mapreduce – Page 5 – Tarik Billa

When do reduce tasks start in Hadoop?

January 30, 2023 by Tarik

The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can … Read more

What is Map/Reduce?

January 30, 2023 by Tarik

From the abstract of Google’s MapReduce research publication page: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the … Read more

Container is running beyond memory limits

January 19, 2023 by Tarik

You should also properly configure the maximum memory allocations for MapReduce. From this HortonWorks tutorial: […] Each machine in our cluster has 48 GB of RAM. Some of this RAM should be >reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for >YARN to use and keep 8 GB for the … Read more

Is there a .NET equivalent to Apache Hadoop? [closed]

January 8, 2023 by Tarik

Have you looked at using Hadoop’s streaming? I use it in python all the time :-). I’m starting to see that the heterogeneous approach is often the best and it looks like other folks are doing the same. If you look at projects like protocol-buffers or facebook’s thrift you see that sometimes it’s just best … Read more

Does MongoDB’s $in clause guarantee order

December 27, 2022 by Tarik

As noted, the order of the arguments in the array of an $in clause does not reflect the order of how the documents are retrieved. That of course will be the natural order or by the selected index order as shown. If you need to preserve this order, then you basically have two options. So … Read more

How does the MapReduce sort algorithm work?

December 22, 2022 by Tarik

Here are some details on Hadoop’s implementation for Terasort: TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to … Read more

Can apache spark run without hadoop?

December 22, 2022 by Tarik

How does Hadoop process records split across block boundaries?

December 18, 2022 by Tarik

Interesting question, I spent some time looking at the code for the details and here are my thoughts. The splits are handled by the client by InputFormat.getSplits, so a look at FileInputFormat gives the following info: For each input file, get the file length, the block size and calculate the split size as max(minSize, min(maxSize, … Read more

Chaining multiple MapReduce jobs in Hadoop

December 17, 2022 by Tarik

I think this tutorial on Yahoo’s developer network will help you with this: Chaining Jobs You use the JobClient.runJob(). The output path of the data from the first job becomes the input path to your second job. These need to be passed in as arguments to your jobs with appropriate code to parse them and … Read more

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

December 8, 2022 by Tarik

First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn’t be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to … Read more