hdfs – Tarik Billa

Accessing stream output from hdfs of MRjob

April 11, 2024 by Tarik

As long as you have a path that contains hdfs:/ you will not succeed as that is never going to be valid. In the comments you mentioned that you tried to add hdfs:// manually, which may be a nice hack, but in your code I don’t see you ‘clean up’ the wrong hdfs:/. So even … Read more

Parquet without Hadoop?

January 7, 2024 by Tarik

Investigating the same question I found that apparently it’s not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet. In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It … Read more

How to delete files from the HDFS?

January 3, 2024 by Tarik

You can use hdfs dfs -rm -R /path/to/HDFS/file since hadoop dfs has been deprecated.

No data nodes are started

December 20, 2023 by Tarik

That error you are getting in the DN log is described here: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#java-io-ioexception-incompatible-namespaceids From that page: At the moment, there seem to be two workarounds as described below. Workaround 1: Start from scratch I can testify that the following steps solve this error, but the side effects won’t make you happy (me neither). The crude … Read more

Is there an equivalent to `pwd` in hdfs?

December 20, 2023 by Tarik

hdfs dfs -pwd does not exist because there is no “working directory” concept in HDFS when you run commands from command line. You cannot execute hdfs dfs -cd in HDFS shell, and then run commands from there, since both HDFS shell and hdfs dfs -cd commands do not exist too, thus making the idea of … Read more

Spark-submit not working when application jar is in hdfs

December 16, 2023 by Tarik

The only way it worked for me, when I was using –master yarn-cluster

Find port number where HDFS is listening

December 11, 2023 by Tarik

Below command available in Apache hadoop 2.7.0 onwards, this can be used for getting the values for the hadoop configuration properties. fs.default.name is deprecated in hadoop 2.0, fs.defaultFS is the updated value. Not sure whether this will work incase of maprfs. hdfs getconf -confKey fs.defaultFS # ( new property ) or hdfs getconf -confKey fs.default.name … Read more

Deploying Spark and HDFS on Docker Swarm doesn’t enable data locality

December 7, 2023 by Tarik

Isn’t it linked to the use of this : <property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property> Using the hostname means being bound to the container and not the service itself, if I’m correct.

How to specify username when putting files on HDFS from a remote machine?

November 30, 2023 by Tarik

If you use the HADOOP_USER_NAME env variable you can tell HDFS which user name to operate with. Note that this only works if your cluster isn’t using security features (e.g. Kerberos). For example: HADOOP_USER_NAME=hdfs hadoop dfs -put …

What are the pros and cons of the Apache Parquet format compared to other formats?

November 8, 2023 by Tarik

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more