How spark read a large file (petabyte) when file can not be fit in spark’s main memory

First of all, Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions – the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 … Read more

Oracle ‘Partition By’ and ‘Row_Number’ keyword

PARTITION BY segregate sets, this enables you to be able to work(ROW_NUMBER(),COUNT(),SUM(),etc) on related set independently. In your query, the related set comprised of rows with similar cdt.country_code, cdt.account, cdt.currency. When you partition on those columns and you apply ROW_NUMBER on them. Those other columns on those combination/set will receive sequential number from ROW_NUMBER But … Read more