hadoop – Tarik Billa

What is hive, Is it a database? [closed]

April 4, 2024 by Tarik

Hive is a data warehousing package/infrastructure built on top of Hadoop. It provides an SQL dialect called Hive Query Language (HQL) for querying data stored in a Hadoop cluster. Like all SQL dialects in widespread use, HQL doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s … Read more

Amazon Emr – What is the need of Task nodes when we have Core nodes?

January 8, 2024 by Tarik

Parquet without Hadoop?

January 7, 2024 by Tarik

Investigating the same question I found that apparently it’s not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet. In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It … Read more

Difference between `hadoop dfs` and `hadoop fs` [closed]

January 5, 2024 by Tarik

You can see definitions of the two commands (hadoop fs & hadoop dfs) in $HADOOP_HOME/bin/hadoop … elif [ “$COMMAND” = “datanode” ] ; then CLASS=’org.apache.hadoop.hdfs.server.datanode.DataNode’ HADOOP_OPTS=”$HADOOP_OPTS $HADOOP_DATANODE_OPTS” elif [ “$COMMAND” = “fs” ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS=”$HADOOP_OPTS $HADOOP_CLIENT_OPTS” elif [ “$COMMAND” = “dfs” ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS=”$HADOOP_OPTS $HADOOP_CLIENT_OPTS” elif [ “$COMMAND” = “dfsadmin” ] … Read more

How to delete files from the HDFS?

January 3, 2024 by Tarik

You can use hdfs dfs -rm -R /path/to/HDFS/file since hadoop dfs has been deprecated.

What is the purpose of “uber mode” in hadoop?

January 2, 2024 by Tarik

What is UBER mode in Hadoop2? Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer. Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM). Uber jobs : Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. … Read more

Hadoop speculative task execution

January 1, 2024 by Tarik

One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration, but the causes may be hard to detect since the tasks still … Read more

Understand Spark: Cluster Manager, Master and Driver nodes

December 25, 2023 by Tarik

1. The Cluster Manager is a long-running service, on which node it is running? Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager. 2. Is it possible that the Master and the Driver nodes will be the same machine? I … Read more

Merging multiple files into one within Hadoop

December 24, 2023 by Tarik

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) – add compression using MR flags. hadoop jar \ $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br> -Dmapred.reduce.tasks=1 \ -Dmapred.job.queue.name=$QUEUE \ -input “$INPUT” \ -output “$OUTPUT” \ -mapper cat \ -reducer cat If you want compression … Read more

Apache Hadoop Yarn – Underutilization of cores

December 22, 2023 by Tarik

The problem lies not with yarn-site.xml or spark-defaults.conf but actually with the resource calculator that assigns the cores to the executors or in the case of MapReduce jobs, to the Mappers/Reducers. The default resource calculator i.e org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator uses only memory information for allocating containers and CPU scheduling is not enabled by default. To use both … Read more