hadoop – Page 2 – Tarik Billa

Working With Hadoop: localhost: Error: JAVA_HOME is not set

December 30, 2023 by Tarik

I am using hadoop 1.1, and faced the same problem. I got it solved through changing JAVA_HOME variable in /etc/hadoop/hadoop-env.sh as: export JAVA_HOME=/usr/lib/jvm/<jdk folder>

No such method exception Hadoop

December 29, 2023 by Tarik

There’s another thing to check when getting errors like this for classes which are writables, mappers, reducers, etc. If the class is an inner class, make sure it’s declared static (i.e. doesn’t need an instance of the enclosing class). Otherwise, Hadoop cannot instantiate your inner class and will give this same error – that a … Read more

Understand Spark: Cluster Manager, Master and Driver nodes

December 25, 2023 by Tarik

1. The Cluster Manager is a long-running service, on which node it is running? Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager. 2. Is it possible that the Master and the Driver nodes will be the same machine? I … Read more

Merging multiple files into one within Hadoop

December 24, 2023 by Tarik

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) – add compression using MR flags. hadoop jar \ $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br> -Dmapred.reduce.tasks=1 \ -Dmapred.job.queue.name=$QUEUE \ -input “$INPUT” \ -output “$OUTPUT” \ -mapper cat \ -reducer cat If you want compression … Read more

Apache Hadoop Yarn – Underutilization of cores

December 22, 2023 by Tarik

The problem lies not with yarn-site.xml or spark-defaults.conf but actually with the resource calculator that assigns the cores to the executors or in the case of MapReduce jobs, to the Mappers/Reducers. The default resource calculator i.e org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator uses only memory information for allocating containers and CPU scheduling is not enabled by default. To use both … Read more

What does msck stands for in Msck repair command

December 21, 2023 by Tarik

Similar to how fsckstands for filesystem consistency check, msck is Hive’s metastore consistency check.

No data nodes are started

December 20, 2023 by Tarik

That error you are getting in the DN log is described here: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#java-io-ioexception-incompatible-namespaceids From that page: At the moment, there seem to be two workarounds as described below. Workaround 1: Start from scratch I can testify that the following steps solve this error, but the side effects won’t make you happy (me neither). The crude … Read more

Is there an equivalent to `pwd` in hdfs?

December 20, 2023 by Tarik

hdfs dfs -pwd does not exist because there is no “working directory” concept in HDFS when you run commands from command line. You cannot execute hdfs dfs -cd in HDFS shell, and then run commands from there, since both HDFS shell and hdfs dfs -cd commands do not exist too, thus making the idea of … Read more

How to make shark/spark clear the cache?

December 19, 2023 by Tarik

To remove all cached data: sqlContext.clearCache() Source: https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SQLContext.html If you want to remove an specific Dataframe from cache: df.unpersist()

Would Spark unpersist the RDD itself when it realizes it won’t be used anymore?

December 17, 2023 by Tarik

Yes, Apache Spark will unpersist the RDD when the RDD object is garbage collected. In RDD.persist you can see: sc.cleaner.foreach(_.registerRDDForCleanup(this)) This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected. And there: sc.unpersistRDD(rddId, blocking) For more context see ContextCleaner in general and the commit that added … Read more