Spark Driver in Apache spark

Question

The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.

In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>

Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.

This is how the configuration of your driver looks like:

val conf = new SparkConf()
      .setMaster("master_url") // this is where the master is specified
      .setAppName("SparkExamplesMinimal")
      .set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves
      .set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version

val sc = new spark.SparkContext(conf)
    // etc...

To explain a bit more on the different roles:

The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

Leave a Comment Cancel reply