apache-spark – Page 19

What does “Stage Skipped” mean in Apache Spark web UI?

December 27, 2022 by Tarik

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data: Shuffle also generates a large number of intermediate files on disk. … Read more

Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

December 27, 2022 by Tarik

This happened to me when I gave more memory to the worker node than it has. Since it didn’t have swap, spark crashed while trying to store objects for shuffling with no more memory left. Solution was to either add swap, or configure the worker/executor to use less memory in addition with using MEMORY_AND_DISK storage … Read more

How to fix ‘TypeError: an integer is required (got type bytes)’ error when trying to run pyspark after installing spark 2.4.4

December 26, 2022 by Tarik

This is happening because you’re using python 3.8. The latest pip release of pyspark (pyspark 2.4.4 at time of writing) doesn’t support python 3.8. Downgrade to python 3.7 for now, and you should be fine.

What do the numbers on the progress bar mean in spark-shell?

December 25, 2022 by Tarik

What you get is a Console Progress Bar, [Stage 7: shows the stage you are in now, and (14174 + 5) / 62500] is (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]. The progress bar shows numCompletedTasks / totalNumOfTasksInThisStage. It will be shown when both spark.ui.showConsoleProgress is true (by default) and log level in conf/log4j.properties is ERROR or … Read more

How to overwrite the output directory in spark

December 15, 2022 by Tarik

UPDATE: Suggest using Dataframes, plus something like … .write.mode(SaveMode.Overwrite) …. Handy pimp: implicit class PimpedStringRDD(rdd: RDD[String]) { def write(p: String)(implicit ss: SparkSession): Unit = { import ss.implicits._ rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p) } } For older versions try yourSparkConf.set(“spark.hadoop.validateOutputSpecs”, “false”) val sc = SparkContext(yourSparkConf) In 1.1.0 you can set conf settings using the spark-submit script with the –conf flag. … Read more

How to kill a running Spark application?

December 11, 2022 by Tarik

copy paste the application Id from the spark scheduler, for instance application_1428487296152_25597 connect to the server that have launch the job yarn application -kill application_1428487296152_25597

How to delete columns in pyspark dataframe

December 1, 2022 by Tarik

Reading the Spark documentation I found an easier solution. Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe. You can use it in two ways df.drop(‘age’) df.drop(df.age) Pyspark Documentation – Drop

How to check if spark dataframe is empty?

November 18, 2022 by Tarik

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you. df.head(1).isEmpty df.take(1).isEmpty with Python equivalent: len(df.head(1)) == 0 # or bool(df.head(1)) len(df.take(1)) == 0 # or bool(df.take(1)) Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. … Read more

How are stages split into tasks in Spark?

October 26, 2022 by Tarik

You have a pretty nice outline here. To answer your questions A separate task does need to be launched for each partition of data for each stage. Consider that each partition will likely reside on distinct physical locations – e.g. blocks in HDFS or directories/volumes for a local file system. Note that the submission of … Read more

How to read multiple text files into a single RDD?

October 21, 2022 by Tarik

You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.: sc.textFile(“/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file”) As Nick Chammas points out this is an exposure of Hadoop’s FileInputFormat and therefore this also works with Hadoop (and Scalding).