Writing SQL vs using Dataframe APIs in Spark SQL

Question : What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?


Answer :

There is comparative study done by horton works. source…

Gist is based on situation/scenario each one is right. there is no
hard and fast rule to decide this. pls go through below..

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:

  • Resilient – if data in memory is lost, it can be recreated
  • Distributed – immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
  • Dataset – initial data can from from files, be created programmatically, from data in memory, or from another RDD

DataFrames API is a data abstraction framework that organizes your data into named columns:

  • Create a schema for the data
  • Conceptually equivalent to a table in a relational database
  • Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
  • Provides a relational view of the data for easy SQL like data manipulations and aggregations
  • Under the hood, it is an RDD of Row’s

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:

  • SQL
  • DataFrames API
  • Datasets API

Test results:

  • RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
  • DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage

  • Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s

  • Took the best out of 3 for each test

  • Times were consistent and not much variation between tests

  • Jobs were run individually with no other jobs running

Random lookup against 1 order ID from 9 Million unique order ID’s
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name

enter image description here

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)