How to handle categorical features with spark-ml?

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

How to assign unique contiguous numbers to elements in a Spark RDD

Starting with Spark 1.0 there are two methods you can use to solve this easily: RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this. RDD.zipWithUniqueId also gives … Read more

What’s the difference between Spark ML and MLLIB packages

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)