apache-spark-mllib – Page 2

How to handle categorical features with spark-ml?

May 10, 2023 by Tarik

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

How to assign unique contiguous numbers to elements in a Spark RDD

May 8, 2023 by Tarik

Starting with Spark 1.0 there are two methods you can use to solve this easily: RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this. RDD.zipWithUniqueId also gives … Read more

What’s the difference between Spark ML and MLLIB packages

May 3, 2023 by Tarik

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward … Read more

What is the difference between Apache Mahout and Apache Spark’s MLlib?

April 8, 2023 by Tarik

The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific – from the difference in per job overhead If your ML algorithm mapped to the single MR job – main difference will be only startup overhead, which … Read more