rdd – Page 4 – Tarik Billa

Spark: subtract two DataFrames

February 25, 2023 by Tarik

According to the Scala API docs, doing: dataFrame1.except(dataFrame2) will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

How to find median and quantiles using Spark

January 31, 2023 by Tarik

Ongoing work SPARK-30569 – Add DSL functions invoking percentile_approx Spark 2.0+: You can use approxQuantile method which implements Greenwald-Khanna algorithm: Python: df.approxQuantile(“x”, [0.5], 0.25) Scala: df.stat.approxQuantile(“x”, Array(0.5), 0.25) where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation. Since Spark 2.2 (SPARK-14352) it supports estimation … Read more

How does HashPartitioner work?

January 28, 2023 by Tarik

Well, lets make your dataset marginally more interesting: val rdd = sc.parallelize(for { x <- 1 to 3 y <- 1 to 2 } yield (x, None), 8) We have six elements: rdd.count Long = 6 no partitioner: rdd.partitioner Option[org.apache.spark.Partitioner] = None and eight partitions: rdd.partitions.length Int = 8 Now lets define small helper to … Read more

What does “Stage Skipped” mean in Apache Spark web UI?

December 27, 2022 by Tarik

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data: Shuffle also generates a large number of intermediate files on disk. … Read more

How to convert rdd object to dataframe in spark

November 28, 2022 by Tarik

This code works perfectly from Spark 2.x with Scala 2.11 Import necessary classes import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType} Create SparkSession Object, and Here it’s spark val spark: SparkSession = SparkSession.builder.master(“local”).getOrCreate val sc = spark.sparkContext // Just used to create test RDDs Let’s an RDD to make it DataFrame val rdd = sc.parallelize( … Read more

Apache Spark: map vs mapPartitions?

November 15, 2022 by Tarik

Imp. TIP : Whenever you have heavyweight initialization that should be done once for many RDD elements rather than once per RDD element, and if this initialization, such as creation of objects from a third-party library, cannot be serialized (so that Spark can transmit it across the cluster to the worker nodes), use mapPartitions() instead … Read more

(Why) do we need to call cache or persist on a RDD

October 20, 2022 by Tarik

Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line: val textFile = sc.textFile(“/user/emp.txt”) It does nothing. It creates an RDD that says “we will need to load this file”. The file is not loaded at this point. RDD operations … Read more

Spark performance for Scala vs Python

October 15, 2022 by Tarik

The original answer discussing the code can be found below. First of all, you have to distinguish between different types of API, each with its own performance considerations. RDD API (pure Python structures with JVM based orchestration) This is the component which will be most affected by the performance of the Python code and the … Read more

What is the difference between cache and persist?

October 10, 2022 by Tarik

With cache(), you use only the default storage level : MEMORY_ONLY for RDD MEMORY_AND_DISK for Dataset With persist(), you can specify which storage level you want for both RDD and Dataset. From the official docs: You can mark an RDD to be persisted using the persist() or cache() methods on it. each persisted RDD can … Read more

Difference between DataFrame, Dataset, and RDD in Spark

September 27, 2022 by Tarik

First thing is DataFrame was evolved from SchemaRDD. Yes.. conversion between Dataframe and RDD is absolutely possible. Below are some sample code snippets. df.rdd is RDD[Row] Below are some of options to create dataframe. 1) yourrddOffrow.toDF converts to DataFrame. 2) Using createDataFrame of sql context val df = spark.createDataFrame(rddOfRow, schema) where schema can be from … Read more