mapreduce – Page 2 – Tarik Billa

Best way to do one-to-many “JOIN” in CouchDB

September 21, 2023 by Tarik

Thank you! This is a great example to show off CouchDB 0.11’s new features! You must use the fetch-related-data feature to reference documents in the view. Optionally, for more convenient JSON, use a _list function to clean up the results. See Couchio’s writeup on “JOIN”s for details. Here is the plan: Firstly, you have a … Read more

Hadoop one Map and multiple Reduce

September 14, 2023 by Tarik

Maybe a simple solution would be to write a job that doesn’t have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job. Then you would write a job for each different reduce function that … Read more

Hadoop DistributedCache is deprecated – what is the preferred API?

August 28, 2023 by Tarik

The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like Job job = new Job(); … job.addCacheFile(new Path(filename).toUri()); In your mapper code: Path[] localPaths = context.getLocalCacheFiles(); …

MongoDB aggregation comparison: group(), $group and MapReduce

August 14, 2023 by Tarik

It is somewhat confusing since the names are similar, but the group() command is a different feature and implementation from the $group pipeline operator in the Aggregation Framework. The group() command, Aggregation Framework, and MapReduce are collectively aggregation features of MongoDB. There is some overlap in features, but I’ll attempt to explain the differences and … Read more

List the namenode and datanodes of a cluster from any node?

August 11, 2023 by Tarik

Use the dfsadmin command: bin/hadoop dfsadmin -report Update (2015): bin/hdfs dfsadmin -report

How does Hadoop perform input splits?

August 1, 2023 by Tarik

The InputFormat is responsible to provide the splits. In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is … Read more

Setting the number of map tasks and reduce tasks

July 29, 2023 by Tarik

The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is … Read more

MongoDB: Terrible MapReduce Performance

July 24, 2023 by Tarik

excerpts from MongoDB Definitive Guide from O’Reilly: The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real … Read more

How to write ‘map only’ hadoop jobs?

June 13, 2023 by Tarik

This turns off the reducer. job.setNumReduceTasks(0); http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)

What are SUCCESS and part-r-00000 files in hadoop

June 7, 2023 by Tarik

See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/ On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947) This would typically be used by job scheduling systems (such as OOZIE), to denote … Read more