How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax : MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS] This was implemented in HIVE-17824 As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn’t remove partitions if the corresponding folder on HDFS was manually deleted, it only … Read more

Avoid performance impact of a single partition mode in Spark window functions

In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one. The difference is only in the number of partitions created in total. Let’s illustrate that with an example using simple dataset … Read more

How to understand the dynamic programming solution in linear partitioning?

Be aware that there’s a small mistake in the explanation of the algorithm in the book, look in the errata for the text “(*) Page 297”. About your questions: No, the items don’t need to be sorted, only contiguous (that is, you can’t rearrange them) I believe the easiest way to visualize the algorithm is … Read more

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

Ok turns out that this problem was more than just a simple create a table, index it and forget problem 🙂 Here’s what I did just in case someone else faces the same problem (I have used an example of IP Address but it works for other data types too): Problem: Your table has millions … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

How to partition and write DataFrame in Spark without deleting partitions with no new data?

This is an old topic, but I was having the same problem and found another solution, just set your partition overwrite mode to dynamic by using: spark.conf.set(‘spark.sql.sources.partitionOverwriteMode’, ‘dynamic’) So, my spark session is configured like this: spark = SparkSession.builder.appName(‘AppName’).getOrCreate() spark.conf.set(‘spark.sql.sources.partitionOverwriteMode’, ‘dynamic’)

LINQ Partition List into Lists of 8 members [duplicate]

Use the following extension method to break the input into subsets public static class IEnumerableExtensions { public static IEnumerable<List<T>> InSetsOf<T>(this IEnumerable<T> source, int max) { List<T> toReturn = new List<T>(max); foreach(var item in source) { toReturn.Add(item); if (toReturn.Count == max) { yield return toReturn; toReturn = new List<T>(max); } } if (toReturn.Any()) { yield return … Read more