apache-spark – Page 2

What’s the most efficient way to filter a DataFrame

January 9, 2024 by Tarik

My code (following the description of your first method) runs normally in Spark 1.4.0-SNAPSHOT on these two configurations: Intellij IDEA’s test Spark Standalone cluster with 8 nodes (1 master, 7 worker) Please check if any differences exists val bc = sc.broadcast(Array[String](“login3”, “login4”)) val x = Array((“login1”, 192), (“login2”, 146), (“login3”, 72)) val xdf = sqlContext.createDataFrame(x).toDF(“name”, … Read more

Spark : Read file only if the path exists

January 8, 2024 by Tarik

You can filter out the irrelevant files as in @Psidom’s answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called “spark” you can do: import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration) def testDirExist(path: String): Boolean = { val p … Read more

Defining a UDF that accepts an Array of objects in a Spark DataFrame?

January 8, 2024 by Tarik

What you’re looking for is Seq[o.a.s.sql.Row]: import org.apache.spark.sql.Row val my_size = udf { subjects: Seq[Row] => subjects.size } Explanation: Current representation of ArrayType is, as you already know, WrappedArray so Array won’t work and it is better to stay on the safe side. According to the official specification, the local (external) type for StructType is … Read more

How to create SparkSession with Hive support (fails with “Hive classes are not found”)?

January 5, 2024 by Tarik

Add following dependency to your maven project. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.0.0</version> </dependency>

Spark save(write) parquet only one file

January 5, 2024 by Tarik

Use coalesce before write operation dataFrame.coalesce(1).write.format(“parquet”).mode(“append”).save(“temp.parquet”) EDIT-1 Upon a closer look, the docs do warn about coalesce However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1) Therefore as … Read more

How to read parquet data from S3 to spark dataframe Python?

January 5, 2024 by Tarik

Unpivot in Spark SQL / PySpark

January 2, 2024 by Tarik

You can use the built in stack function, for example in Scala: scala> val df = Seq((“G”,Some(4),2,None),(“H”,None,4,Some(5))).toDF(“A”,”X”,”Y”, “Z”) df: org.apache.spark.sql.DataFrame = [A: string, X: int … 2 more fields] scala> df.show +—+—-+—+—-+ | A| X| Y| Z| +—+—-+—+—-+ | G| 4| 2|null| | H|null| 4| 5| +—+—-+—+—-+ scala> df.select($”A”, expr(“stack(3, ‘X’, X, ‘Y’, Y, ‘Z’, … Read more

Pyspark – converting json string to DataFrame

January 2, 2024 by Tarik

You can do the following newJson = ‘{“Name”:”something”,”Url”:”https://stackoverflow.com”,”Author”:”jangcy”,”BlogEntries”:100,”Caller”:”jangcy”}’ df = spark.read.json(sc.parallelize([newJson])) df.show(truncate=False) which should give +——+———–+——+———+————————-+ |Author|BlogEntries|Caller|Name |Url | +——+———–+——+———+————————-+ |jangcy|100 |jangcy|something|https://stackoverflow.com| +——+———–+——+———+————————-+

Retrieve SparkContext from SparkSession

January 2, 2024 by Tarik

Just to post as an answer – the SparkContext can be accessed from SparkSession using spark.sparkContext (no parenthesis)

Improve PySpark DataFrame.show output to fit Jupyter notebook

January 2, 2024 by Tarik

After playing around with my table which has a lot of columns I decided the best thing to do to get a feel for the data is to use: df.show(n=5, truncate=False, vertical=True) This displays it vertically without truncation and is the cleanest viewing I can come up with.