apache-spark-ml – Tarik Billa

Dropping a nested column from Spark DataFrame

September 29, 2023 by Tarik

It is just a programming exercise but you can try something like this: import org.apache.spark.sql.{DataFrame, Column} import org.apache.spark.sql.types.{StructType, StructField} import org.apache.spark.sql.{functions => f} import scala.util.Try case class DFWithDropFrom(df: DataFrame) { def getSourceField(source: String): Try[StructField] = { Try(df.schema.fields.filter(_.name == source).head) } def getType(sourceField: StructField): Try[StructType] = { Try(sourceField.dataType.asInstanceOf[StructType]) } def genOutputCol(names: Array[String], source: String): Column = … Read more

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

September 22, 2023 by Tarik

Update: It seems like there is a bug in spark that prevents you from accessing individual elements in a dense vector during a select statement. Normally you should would be able to access them just like you would a numpy array, but when trying to run the code previously posted, you may get the error … Read more

Create a custom Transformer in PySpark ML

September 20, 2023 by Tarik

pyspark : NameError: name ‘spark’ is not defined

September 14, 2023 by Tarik

You can add from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext(‘local’) spark = SparkSession(sc) to the begining of your code to define a SparkSession, then the spark.createDataFrame() should work.

How to extract model hyper-parameters from spark.ml in PySpark?

August 22, 2023 by Tarik

Ran into this problem as well. I found out you need to call the java property for some reason I don’t know why. So just do this: from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator from pyspark.ml.regression import LinearRegression from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator(metricName=”mae”) lr = LinearRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \ .addGrid(lr.regParam, [0]) \ … Read more

Column name with dot spark

August 14, 2023 by Tarik

If your problem is the .(dot) in the column name, you could use `(backticks) to enclose the column name. df.select(“`col0.1`”)

How to handle categorical features with spark-ml?

May 10, 2023 by Tarik

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

What’s the difference between Spark ML and MLLIB packages

May 3, 2023 by Tarik

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward … Read more

How to split Vector into columns – using PySpark [duplicate]

February 28, 2023 by Tarik

Spark >= 3.0.0 Since Spark 3.0.0 this can be done without using UDF. from pyspark.ml.functions import vector_to_array (df .withColumn(“xs”, vector_to_array(“vector”))) .select([“word”] + [col(“xs”)[i] for i in range(3)])) ## +——-+—–+—–+—–+ ## | word|xs[0]|xs[1]|xs[2]| ## +——-+—–+—–+—–+ ## | assert| 1.0| 2.0| 3.0| ## |require| 0.0| 2.0| 0.0| ## +——-+—–+—–+—–+ Spark < 3.0.0 One possible approach is to … Read more

How do I convert an array (i.e. list) column to Vector

February 14, 2023 by Tarik

Personally I would go with Python UDF and wouldn’t bother with anything else: Vectors are not native SQL types so there will be performance overhead one way or another. In particular this process requires two steps where data is first converted from external type to row, and then from row to internal representation using generic … Read more