How to find median and quantiles using Spark

Ongoing work SPARK-30569 – Add DSL functions invoking percentile_approx Spark 2.0+: You can use approxQuantile method which implements Greenwald-Khanna algorithm: Python: df.approxQuantile(“x”, [0.5], 0.25) Scala: df.stat.approxQuantile(“x”, Array(0.5), 0.25) where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation. Since Spark 2.2 (SPARK-14352) it supports estimation … Read more

How does HashPartitioner work?

Well, lets make your dataset marginally more interesting: val rdd = sc.parallelize(for { x <- 1 to 3 y <- 1 to 2 } yield (x, None), 8) We have six elements: rdd.count Long = 6 no partitioner: rdd.partitioner Option[org.apache.spark.Partitioner] = None and eight partitions: rdd.partitions.length Int = 8 Now lets define small helper to … Read more

What does “Stage Skipped” mean in Apache Spark web UI?

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data: Shuffle also generates a large number of intermediate files on disk. … Read more

How to convert rdd object to dataframe in spark

This code works perfectly from Spark 2.x with Scala 2.11 Import necessary classes import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType} Create SparkSession Object, and Here it’s spark val spark: SparkSession = SparkSession.builder.master(“local”).getOrCreate val sc = spark.sparkContext // Just used to create test RDDs Let’s an RDD to make it DataFrame val rdd = sc.parallelize( … Read more

Apache Spark: map vs mapPartitions?

Imp. TIP : Whenever you have heavyweight initialization that should be done once for many RDD elements rather than once per RDD element, and if this initialization, such as creation of objects from a third-party library, cannot be serialized (so that Spark can transmit it across the cluster to the worker nodes), use mapPartitions() instead … Read more

Difference between DataFrame, Dataset, and RDD in Spark

First thing is DataFrame was evolved from SchemaRDD. Yes.. conversion between Dataframe and RDD is absolutely possible. Below are some sample code snippets. df.rdd is RDD[Row] Below are some of options to create dataframe. 1) yourrddOffrow.toDF converts to DataFrame. 2) Using createDataFrame of sql context val df = spark.createDataFrame(rddOfRow, schema) where schema can be from … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)