hadoop – Page 3 – Tarik Billa

Spark-submit not working when application jar is in hdfs

December 16, 2023 by Tarik

The only way it worked for me, when I was using –master yarn-cluster

How to rename a hive table without changing location?

December 14, 2023 by Tarik

Yeah we can do that. You just need to follow below three commands in sequence. Lets say you have a external table test_1 in hive. And you want to rename it test_2 which should point test_2 location not test_1. Then you need to convert this table into Managed table using below command. test_1 -> pointing … Read more

Add a column in a table in HIVE QL

December 14, 2023 by Tarik

You cannot add a column with a default value in Hive. You have the right syntax for adding the column ALTER TABLE test1 ADD COLUMNS (access_count1 int);, you just need to get rid of default sum(max_count). No changes to that files backing your table will happen as a result of adding the column. Hive handles … Read more

How to write to CSV in Spark

December 12, 2023 by Tarik

Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do rdd.saveAsTextFile(“foo”) It will be saved as “foo/part-XXXXX” with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for … Read more

Find port number where HDFS is listening

December 11, 2023 by Tarik

Below command available in Apache hadoop 2.7.0 onwards, this can be used for getting the values for the hadoop configuration properties. fs.default.name is deprecated in hadoop 2.0, fs.defaultFS is the updated value. Not sure whether this will work incase of maprfs. hdfs getconf -confKey fs.defaultFS # ( new property ) or hdfs getconf -confKey fs.default.name … Read more

Select top 2 rows in Hive

December 10, 2023 by Tarik

Yes, here you can use LIMIT. You can try it by the below query: SELECT * FROM employee_list SORT BY salary DESC LIMIT 2

Deploying Spark and HDFS on Docker Swarm doesn’t enable data locality

December 7, 2023 by Tarik

Isn’t it linked to the use of this : <property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property> Using the hostname means being bound to the container and not the service itself, if I’m correct.

What exactly is hadoop namenode formatting?

December 6, 2023 by Tarik

Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. In short, it keeps the metadata related to datanodes. When we format namenode it formats the meta-data related to data-nodes. By … Read more

How do I run graphx with Python / pyspark?

December 4, 2023 by Tarik

You should look at GraphFrames (https://github.com/graphframes/graphframes), which wraps GraphX algorithms under the DataFrames API and it provides Python interface. Here is a quick example from https://graphframes.github.io/graphframes/docs/_site/quick-start.html, with slight modification so that it works first start pyspark with the graphframes pkg loaded pyspark –packages graphframes:graphframes:0.1.0-spark1.6 python code: from graphframes import * # Create a Vertex DataFrame … Read more

Sorting large data using MapReduce/Hadoop

December 4, 2023 by Tarik

Check out merge-sort. It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list. If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of … Read more