apache-spark – Page 61

What are workers, executors, cores in Spark Standalone cluster?

October 3, 2022 by Tarik

Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes. DRIVER The driver is the process where the main method runs. First it converts the user program … Read more

Spark java.lang.OutOfMemoryError: Java heap space

October 3, 2022 by Tarik

I have a few suggestions: If your nodes are configured to have 6g maximum for Spark (and are leaving a little for other processes), then use 6g rather than 4g, spark.executor.memory=6g. Make sure you’re using as much memory as possible by checking the UI (it will say how much mem you’re using) Try using more … Read more

How to show full column content in a Spark Dataframe?

September 30, 2022 by Tarik

results.show(20, false) will not truncate. Check the source 20 is the default number of rows displayed when show() is called without any arguments.

What is the difference between map and flatMap and a good use case for each?

September 30, 2022 by Tarik

Here is an example of the difference, as a spark-shell session: First, some data – two lines of text: val rdd = sc.parallelize(Seq(“Roses are red”, “Violets are blue”)) // lines rdd.collect res0: Array[String] = Array(“Roses are red”, “Violets are blue”) Now, map transforms an RDD of length N into another RDD of length N. For … Read more

How to change dataframe column names in PySpark?

September 30, 2022 by Tarik

Difference between DataFrame, Dataset, and RDD in Spark

September 27, 2022 by Tarik

First thing is DataFrame was evolved from SchemaRDD. Yes.. conversion between Dataframe and RDD is absolutely possible. Below are some sample code snippets. df.rdd is RDD[Row] Below are some of options to create dataframe. 1) yourrddOffrow.toDF converts to DataFrame. 2) Using createDataFrame of sql context val df = spark.createDataFrame(rddOfRow, schema) where schema can be from … Read more

Spark – repartition() vs coalesce()

September 23, 2022 by Tarik

It avoids a full shuffle. If it’s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. So, it would go something like this: Node 1 = 1,2,3 Node 2 = 4,5,6 … Read more