partitioning – Tarik Billa

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

April 9, 2024 by Tarik

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax : MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS] This was implemented in HIVE-17824 As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn’t remove partitions if the corresponding folder on HDFS was manually deleted, it only … Read more

How to perform one operation on each executor once in spark

December 26, 2023 by Tarik

You have two options: 1. Create a singleton object with a lazy val representing the data: object WekaModel { lazy val data = { // initialize data here. This will only happen once per JVM process } } Then, you can use the lazy val in your map function. The lazy val ensures that each … Read more

How many table partitions is too many in Postgres?

December 17, 2023 by Tarik

The query planner has to do a linear search of the constraint information for every partition of tables used in the query, to figure out which are actually involved–the ones that can have rows needed for the data requested. The number of query plans the planner considers grows exponentially as you join more tables. So … Read more

Avoid performance impact of a single partition mode in Spark window functions

December 16, 2023 by Tarik

In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one. The difference is only in the number of partitions created in total. Let’s illustrate that with an example using simple dataset … Read more

How to understand the dynamic programming solution in linear partitioning?

December 15, 2023 by Tarik

Be aware that there’s a small mistake in the explanation of the algorithm in the book, look in the errata for the text “(*) Page 297”. About your questions: No, the items don’t need to be sorted, only contiguous (that is, you can’t rearrange them) I believe the easiest way to visualize the algorithm is … Read more

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

September 20, 2023 by Tarik

Ok turns out that this problem was more than just a simple create a table, index it and forget problem 🙂 Here’s what I did just in case someone else faces the same problem (I have used an example of IP Address but it works for other data types too): Problem: Your table has millions … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

August 17, 2023 by Tarik

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

How to partition and write DataFrame in Spark without deleting partitions with no new data?

June 15, 2023 by Tarik

This is an old topic, but I was having the same problem and found another solution, just set your partition overwrite mode to dynamic by using: spark.conf.set(‘spark.sql.sources.partitionOverwriteMode’, ‘dynamic’) So, my spark session is configured like this: spark = SparkSession.builder.appName(‘AppName’).getOrCreate() spark.conf.set(‘spark.sql.sources.partitionOverwriteMode’, ‘dynamic’)

LINQ Partition List into Lists of 8 members [duplicate]

May 25, 2023 by Tarik

Use the following extension method to break the input into subsets public static class IEnumerableExtensions { public static IEnumerable<List<T>> InSetsOf<T>(this IEnumerable<T> source, int max) { List<T> toReturn = new List<T>(max); foreach(var item in source) { toReturn.Add(item); if (toReturn.Count == max) { yield return toReturn; toReturn = new List<T>(max); } } if (toReturn.Any()) { yield return … Read more

MySQL Partitioning / Sharding / Splitting – which way to go?

May 8, 2023 by Tarik

You will definitely start to run into issues on that 42 GB table once it no longer fits in memory. In fact, as soon as it does not fit in memory anymore, performance will degrade extremely quickly. One way to test is to put that table on another machine with less RAM and see how … Read more