What does “Stage Skipped” mean in Apache Spark web UI?

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data: Shuffle also generates a large number of intermediate files on disk. … Read more

Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

This happened to me when I gave more memory to the worker node than it has. Since it didn’t have swap, spark crashed while trying to store objects for shuffling with no more memory left. Solution was to either add swap, or configure the worker/executor to use less memory in addition with using MEMORY_AND_DISK storage … Read more

What do the numbers on the progress bar mean in spark-shell?

What you get is a Console Progress Bar, [Stage 7: shows the stage you are in now, and (14174 + 5) / 62500] is (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]. The progress bar shows numCompletedTasks / totalNumOfTasksInThisStage. It will be shown when both spark.ui.showConsoleProgress is true (by default) and log level in conf/log4j.properties is ERROR or … Read more

How to overwrite the output directory in spark

UPDATE: Suggest using Dataframes, plus something like … .write.mode(SaveMode.Overwrite) …. Handy pimp: implicit class PimpedStringRDD(rdd: RDD[String]) { def write(p: String)(implicit ss: SparkSession): Unit = { import ss.implicits._ rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p) } } For older versions try yourSparkConf.set(“spark.hadoop.validateOutputSpecs”, “false”) val sc = SparkContext(yourSparkConf) In 1.1.0 you can set conf settings using the spark-submit script with the –conf flag. … Read more

How to check if spark dataframe is empty?

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you. df.head(1).isEmpty df.take(1).isEmpty with Python equivalent: len(df.head(1)) == 0 # or bool(df.head(1)) len(df.take(1)) == 0 # or bool(df.take(1)) Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)