What is the difference between Apache Mahout and Apache Spark’s MLlib?

The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific – from the difference in per job overhead
If your ML algorithm mapped to the single MR job – main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs.
In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.

  • On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
  • On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.

In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount – I would consider Mahout as serious alternative.

Leave a Comment

techhipbettruvabetnorabahisbahis forumu