hive – Tarik Billa

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

April 9, 2024 by Tarik

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax : MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS] This was implemented in HIVE-17824 As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn’t remove partitions if the corresponding folder on HDFS was manually deleted, it only … Read more

What is hive, Is it a database? [closed]

April 4, 2024 by Tarik

Hive is a data warehousing package/infrastructure built on top of Hadoop. It provides an SQL dialect called Hive Query Language (HQL) for querying data stored in a Hadoop cluster. Like all SQL dialects in widespread use, HQL doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s … Read more

Compress file on S3

April 3, 2024 by Tarik

Spark final task takes 100x times longer than first 199, how to improve

January 9, 2024 by Tarik

Spark >= 3.0 Since 3.0 Spark provides built-in optimizations for handling skewed joins – which can be enabled using spark.sql.adaptive.optimizeSkewedJoin.enabled property. See SPARK-29544 for details. Spark < 3.0 You clearly have a problem with a huge right data skew. Lets take a look a the statistics you’ve provided: df1 = [mean=4.989209978967438, stddev=2255.654165352454, count=2400088] df2 = … Read more

How to create SparkSession with Hive support (fails with “Hive classes are not found”)?

January 5, 2024 by Tarik

Add following dependency to your maven project. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.0.0</version> </dependency>

How to convert .txt file to Hadoop’s sequence file format

January 3, 2024 by Tarik

So the way more simplest answer is just an “identity” job that has a SequenceFile output. Looks like this in java: public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf); job.setJobName(“Convert Text”); job.setJarByClass(Mapper.class); job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // increase if you need sorting or a special … Read more

What does msck stands for in Msck repair command

December 21, 2023 by Tarik

Similar to how fsckstands for filesystem consistency check, msck is Hive’s metastore consistency check.

Transferring hive table from one database to another

December 19, 2023 by Tarik

Since 0.14, you can use following statement to move table from one database to another in the same metastore: use old_database; alter table table_a rename to new_database.table_a The above statements will also move the table data on hdfs if table_a is a managed table.

How to make shark/spark clear the cache?

December 19, 2023 by Tarik

To remove all cached data: sqlContext.clearCache() Source: https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SQLContext.html If you want to remove an specific Dataframe from cache: df.unpersist()

How to rename a hive table without changing location?

December 14, 2023 by Tarik

Yeah we can do that. You just need to follow below three commands in sequence. Lets say you have a external table test_1 in hive. And you want to rename it test_2 which should point test_2 location not test_1. Then you need to convert this table into Managed table using below command. test_1 -> pointing … Read more