hadoop – Page 21 – Tarik Billa

What is the difference between partitioning and bucketing a table in Hive ?

November 19, 2022 by Tarik

Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response … Read more

Spark – load CSV file as DataFrame?

November 13, 2022 by Tarik

spark-csv is part of core Spark functionality and doesn’t require a separate library. So you could just do for example df = spark.read.format(“csv”).option(“header”, “true”).load(“csvfile.csv”) In scala,(this works for any format-in delimiter mention “,” for csv, “\t” for tsv etc) val df = sqlContext.read.format(“com.databricks.spark.csv”) .option(“delimiter”, “,”) .load(“csvfile.csv”)

How to copy file from HDFS to the local file system

November 13, 2022 by Tarik

bin/hadoop fs -get /hdfs/source/path /localfs/destination/path bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path Point your web browser to HDFS WEBUI(namenode_machine:50070), browse to the file you intend to copy, scroll down the page and click on download the file.

How to turn off INFO logging in Spark?

October 25, 2022 by Tarik

Just execute this command in the spark directory: cp conf/log4j.properties.template conf/log4j.properties Edit log4j.properties: # Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Settings to quiet third party logs that are too verbose log4j.logger.org.eclipse.jetty=WARN log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO Replace at the first line: log4j.rootCategory=INFO, console by: … Read more

What are the pros and cons of parquet format compared to other formats?

October 21, 2022 by Tarik

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more

When to use Hadoop, HBase, Hive and Pig?

October 18, 2022 by Tarik

MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal … Read more

Apache Spark: The number of cores vs. the number of executors

October 9, 2022 by Tarik

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set … Read more

Difference between Pig and Hive? Why have both? [closed]

October 5, 2022 by Tarik

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.

Hadoop “Unable to load native-hadoop library for your platform” warning

September 29, 2022 by Tarik

I assume you’re running Hadoop on 64bit CentOS. The reason you saw that warning is the native Hadoop library $HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit. Anyway, it’s just a warning, and won’t impact Hadoop’s functionalities. Here is the way if you do want to eliminate this warning, download the source code of Hadoop and … Read more