Dropping a nested column from Spark DataFrame

It is just a programming exercise but you can try something like this: import org.apache.spark.sql.{DataFrame, Column} import org.apache.spark.sql.types.{StructType, StructField} import org.apache.spark.sql.{functions => f} import scala.util.Try case class DFWithDropFrom(df: DataFrame) { def getSourceField(source: String): Try[StructField] = { Try(df.schema.fields.filter(_.name == source).head) } def getType(sourceField: StructField): Try[StructType] = { Try(sourceField.dataType.asInstanceOf[StructType]) } def genOutputCol(names: Array[String], source: String): Column = … Read more

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

Update: It seems like there is a bug in spark that prevents you from accessing individual elements in a dense vector during a select statement. Normally you should would be able to access them just like you would a numpy array, but when trying to run the code previously posted, you may get the error … Read more

How to extract model hyper-parameters from spark.ml in PySpark?

Ran into this problem as well. I found out you need to call the java property for some reason I don’t know why. So just do this: from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator from pyspark.ml.regression import LinearRegression from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator(metricName=”mae”) lr = LinearRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \ .addGrid(lr.regParam, [0]) \ … Read more

How to handle categorical features with spark-ml?

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

What’s the difference between Spark ML and MLLIB packages

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward … Read more

How to split Vector into columns – using PySpark [duplicate]

Spark >= 3.0.0 Since Spark 3.0.0 this can be done without using UDF. from pyspark.ml.functions import vector_to_array (df .withColumn(“xs”, vector_to_array(“vector”))) .select([“word”] + [col(“xs”)[i] for i in range(3)])) ## +——-+—–+—–+—–+ ## | word|xs[0]|xs[1]|xs[2]| ## +——-+—–+—–+—–+ ## | assert| 1.0| 2.0| 3.0| ## |require| 0.0| 2.0| 0.0| ## +——-+—–+—–+—–+ Spark < 3.0.0 One possible approach is to … Read more

How do I convert an array (i.e. list) column to Vector

Personally I would go with Python UDF and wouldn’t bother with anything else: Vectors are not native SQL types so there will be performance overhead one way or another. In particular this process requires two steps where data is first converted from external type to row, and then from row to internal representation using generic … Read more

tech