hadoop – Tarik Billa

Accessing stream output from hdfs of MRjob

April 11, 2024 by Tarik

As long as you have a path that contains hdfs:/ you will not succeed as that is never going to be valid. In the comments you mentioned that you tried to add hdfs:// manually, which may be a nice hack, but in your code I don’t see you ‘clean up’ the wrong hdfs:/. So even … Read more

What is hive, Is it a database? [closed]

April 4, 2024 by Tarik

Hive is a data warehousing package/infrastructure built on top of Hadoop. It provides an SQL dialect called Hive Query Language (HQL) for querying data stored in a Hadoop cluster. Like all SQL dialects in widespread use, HQL doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s … Read more

What is the advantage of storing schema in avro?

January 9, 2024 by Tarik

Evolving schemas Suppose intially you designed an schema like this for your Employee class { {“name”: “emp_name”, “type”:”string”}, {“name”:”dob”, “type”:”string”}, {“name”:”age”, “type”:”int”} } Later you realized that age is redundant and removed it from the schema. { {“name”: “emp_name”, “type”:”string”}, {“name”:”dob”, “type”:”string”} } What about the records that were serialized and stored before this schema … Read more

Amazon Emr – What is the need of Task nodes when we have Core nodes?

January 8, 2024 by Tarik

Parquet without Hadoop?

January 7, 2024 by Tarik

Investigating the same question I found that apparently it’s not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet. In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It … Read more

Difference between `hadoop dfs` and `hadoop fs` [closed]

January 5, 2024 by Tarik

You can see definitions of the two commands (hadoop fs & hadoop dfs) in $HADOOP_HOME/bin/hadoop … elif [ “$COMMAND” = “datanode” ] ; then CLASS=’org.apache.hadoop.hdfs.server.datanode.DataNode’ HADOOP_OPTS=”$HADOOP_OPTS $HADOOP_DATANODE_OPTS” elif [ “$COMMAND” = “fs” ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS=”$HADOOP_OPTS $HADOOP_CLIENT_OPTS” elif [ “$COMMAND” = “dfs” ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS=”$HADOOP_OPTS $HADOOP_CLIENT_OPTS” elif [ “$COMMAND” = “dfsadmin” ] … Read more

How to convert .txt file to Hadoop’s sequence file format

January 3, 2024 by Tarik

So the way more simplest answer is just an “identity” job that has a SequenceFile output. Looks like this in java: public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf); job.setJobName(“Convert Text”); job.setJarByClass(Mapper.class); job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // increase if you need sorting or a special … Read more

How to delete files from the HDFS?

January 3, 2024 by Tarik

You can use hdfs dfs -rm -R /path/to/HDFS/file since hadoop dfs has been deprecated.

What is the purpose of “uber mode” in hadoop?

January 2, 2024 by Tarik

What is UBER mode in Hadoop2? Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer. Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM). Uber jobs : Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. … Read more

Hadoop speculative task execution

January 1, 2024 by Tarik

One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration, but the causes may be hard to detect since the tasks still … Read more