rdd – Page 2 – Tarik Billa

Spark RDD – Mapping with extra arguments

August 16, 2023 by Tarik

You can use an anonymous function either directly in a flatMap json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2)) or to curry processDataLine f = lambda j: processDataLine(j, arg1, arg2) json_data_rdd.flatMap(f) You can generate processDataLine like this: def processDataLine(arg1, arg2): def _processDataLine(dataline): return … # Do something with dataline, arg1, arg2 return _processDataLine json_data_rdd.flatMap(processDataLine(arg1, arg2)) toolz library provides … Read more

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

August 14, 2023 by Tarik

Now a much better way to do this is to use the rdd.aggregateByKey() method. Because this method is so poorly documented in the Apache Spark with Python documentation — and is why I wrote this Q&A — until recently I had been using the above code sequence. But again, it’s less efficient, so avoid doing … Read more

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

August 7, 2023 by Tarik

sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext. SQLContext is entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession is the unified entry point of Spark. … Read more

How do I split an RDD into two or more RDDs?

July 30, 2023 by Tarik

It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example: def even(x): return x % 2 == 0 def odd(x): return not even(x) rdd = sc.parallelize(range(20)) rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, … Read more

Spark parquet partitioning : Large number of files

July 12, 2023 by Tarik

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark)) Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing … Read more

Spark union of multiple RDDs

June 12, 2023 by Tarik

If these are RDDs you can use SparkContext.union method: rdd1 = sc.parallelize([1, 2, 3]) rdd2 = sc.parallelize([4, 5, 6]) rdd3 = sc.parallelize([7, 8, 9]) rdd = sc.union([rdd1, rdd2, rdd3]) rdd.collect() ## [1, 2, 3, 4, 5, 6, 7, 8, 9] There is no DataFrame equivalent but it is just a matter of a simple one-liner: … Read more

How to read from hbase using spark

May 22, 2023 by Tarik

A Basic Example to Read the HBase data using Spark (Scala), You can also wrtie this in Java : import org.apache.hadoop.hbase.client.{HBaseAdmin, Result} import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor } import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.spark._ object HBaseRead { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(“HBaseRead”).setMaster(“local[2]”) val sc = new SparkContext(sparkConf) val conf = HBaseConfiguration.create() val … Read more

What is RDD in spark

May 21, 2023 by Tarik

An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc. The formal definition is: RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate … Read more

Number of partitions in RDD and performance in Spark

May 15, 2023 by Tarik

The primary effect would be by specifying too few partitions or far too many partitions. Too few partitions You will not utilize all of the cores available in the cluster. Too many partitions There will be excessive overhead in managing many small tasks. Between the two the first one is far more impactful on performance. … Read more

Difference between DataFrame, Dataset, and RDD in Spark

May 13, 2023 by Tarik

First thing is DataFrame was evolved from SchemaRDD. Yes.. conversion between Dataframe and RDD is absolutely possible. Below are some sample code snippets. df.rdd is RDD[Row] Below are some of options to create dataframe. 1) yourrddOffrow.toDF converts to DataFrame. 2) Using createDataFrame of sql context val df = spark.createDataFrame(rddOfRow, schema) where schema can be from … Read more