mapreduce – Page 4 – Tarik Billa

Difference between Fork/Join and Map/Reduce

April 29, 2023 by Tarik

One key difference is that F-J seems to be designed to work on a single Java VM, while M-R is explicitly designed to work on a large cluster of machines. These are very different scenarios. F-J offers facilities to partition a task into several subtasks, in a recursive-looking fashion; more tiers, possibility of ‘inter-fork’ communication … Read more

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

April 15, 2023 by Tarik

Actually stdout only shows the System.out.println() of the non-map reduce classes. The System.out.println() for map and reduce phases can be seen in the logs. Easy way to access the logs is http://localhost:50030/jobtracker.jsp->click on the completed job->click on map or reduce task->click on tasknumber->task logs->stdout logs. Hope this helps

Is Mongodb Aggregation framework faster than map/reduce?

April 13, 2023 by Tarik

Every test I have personally run (including using your own data) shows aggregation framework being a multiple faster than map reduce, and usually being an order of magnitude faster. Just taking 1/10th of the data you posted (but rather than clearing OS cache, warming the cache first – because I want to measure performance of … Read more

Find all duplicate documents in a MongoDB collection by a key field

March 21, 2023 by Tarik

The accepted answer is terribly slow on large collections, and doesn’t return the _ids of the duplicate records. Aggregation is much faster and can return the _ids: db.collection.aggregate([ { $group: { _id: { name: “$name” }, // replace `name` here twice uniqueIds: { $addToSet: “$_id” }, count: { $sum: 1 } } }, { $match: … Read more

Integration testing Hive jobs

February 24, 2023 by Tarik

Ideally one would be able to test hive queries with LocalJobRunner rather than resorting to mini-cluster testing. However, due to HIVE-3816 running hive with mapred.job.tracker=local results in a call to the hive CLI executable installed on the system (as described in your question). Until HIVE-3816 is resolved, mini-cluster testing is the only option. Below is … Read more

merge output files after reduce phase

February 13, 2023 by Tarik

Instead of doing the file merging on your own, you can delegate the entire merging of the reduce output files by calling: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt Note This combines the HDFS files locally. Make sure you have enough disk space before running

Hadoop truncated/inconsistent counter name

February 9, 2023 by Tarik

There’s nothing in Hadoop code which truncates counter names after its initialization. So, as you’ve already pointed out, mapreduce.job.counters.counter.name.max controls counter’s name max length (with 64 symbols as default value). This limit is applied during calls to AbstractCounterGroup.addCounter/findCounter. Respective source code is the following: @Override public synchronized T addCounter(String counterName, String displayName, long value) { … Read more

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

February 8, 2023 by Tarik

That’s not the real error, here’s how to find it: Go to the hadoop jobtracker web-dashboard, find the hive mapreduce jobs that failed and look at the logs of the failed tasks. That will show you the real error. The console output errors are useless, largely beause it doesn’t have a view of the individual … Read more

Count lines in large files

February 6, 2023 by Tarik

Try: sed -n ‘$=’ filename Also cat is unnecessary: wc -l filename is enough in your present way.

MongoDB Stored Procedure Equivalent

February 5, 2023 by Tarik

The closest thing to an equivalent of a stored procedure in mongodb is stored javascript. A good introduction to stored javascript is available in this article on Mike Dirolf’s blog.