apache-spark – Page 3

Integrating Spark Structured Streaming with the Confluent Schema Registry

December 20, 2023 by Tarik

It took me a couple months of reading source code and testing things out. In a nutshell, Spark can only handle String and Binary serialization. You must manually deserialize the data. In spark, create the confluent rest service object to get the schema. Convert the schema string in the response object into an Avro schema … Read more

spark createOrReplaceTempView vs createGlobalTempView

December 20, 2023 by Tarik

The Answer to your questions is basically understanding the difference of a Spark Application and a Spark Session. Spark application can be used: for a single batch job an interactive session with multiple jobs a long-lived server continually satisfying requests A Spark job can consist of more than just a single map and reduce. A … Read more

How to express a column which name contains spaces in Spark SQL?

December 19, 2023 by Tarik

Backticks seem to work just fine: scala> val df = sc.parallelize(Seq((“a”, 1))).toDF(“foo bar”, “x”) df: org.apache.spark.sql.DataFrame = [foo bar: string, x: int] scala> df.registerTempTable(“df”) scala> sqlContext.sql(“””SELECT `foo bar` FROM df”””).show foo bar a Same as DataFrame API: scala> df.select($”foo bar”).show foo bar a So it looks like it is supported, although I doubt it is … Read more

Does spark.sql.autoBroadcastJoinThreshold work for joins using Dataset’s join operator?

December 18, 2023 by Tarik

First of all spark.sql.autoBroadcastJoinThreshold and broadcast hint are separate mechanisms. Even if autoBroadcastJoinThreshold is disabled setting broadcast hint will take precedence. With default settings: spark.conf.get(“spark.sql.autoBroadcastJoinThreshold”) String = 10485760 val df1 = spark.range(100) val df2 = spark.range(100) Spark will use autoBroadcastJoinThreshold and automatically broadcast data: df1.join(df2, Seq(“id”)).explain == Physical Plan == *Project [id#0L] +- *BroadcastHashJoin [id#0L], … Read more

Would Spark unpersist the RDD itself when it realizes it won’t be used anymore?

December 17, 2023 by Tarik

Yes, Apache Spark will unpersist the RDD when the RDD object is garbage collected. In RDD.persist you can see: sc.cleaner.foreach(_.registerRDDForCleanup(this)) This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected. And there: sc.unpersistRDD(rddId, blocking) For more context see ContextCleaner in general and the commit that added … Read more

Avoid performance impact of a single partition mode in Spark window functions

December 16, 2023 by Tarik

In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one. The difference is only in the number of partitions created in total. Let’s illustrate that with an example using simple dataset … Read more

Require kryo serialization in Spark (Scala)

December 15, 2023 by Tarik

As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization. No. If you set spark.serializer to org.apache.spark.serializer. KryoSerializer then Spark will use Kryo. If Kryo is not available, you will get an error. There is no fallback. So … Read more

Spark: How to kill running process without exiting shell?

December 7, 2023 by Tarik

You can use the Master Web Interface to kill or Visualize the Job. Also you will find other things there like log file or your cluster working chart…

How do you control the size of the output file?

December 2, 2023 by Tarik

It’s impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before writing to disks. Before this process finishes, there is no way to estimate the actual file size on disk. So my solution is: Write the DataFrame to HDFS, df.write.parquet(path) Get the directory … Read more

Spark + s3 – error – java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

November 27, 2023 by Tarik