apache-spark-sql – Page 29

How to check if spark dataframe is empty?

November 18, 2022 by Tarik

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you. df.head(1).isEmpty df.take(1).isEmpty with Python equivalent: len(df.head(1)) == 0 # or bool(df.head(1)) len(df.take(1)) == 0 # or bool(df.take(1)) Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. … Read more

Filter Pyspark dataframe column with None value

November 14, 2022 by Tarik

You can use Column.isNull / Column.isNotNull: df.where(col(“dt_mvmt”).isNull()) df.where(col(“dt_mvmt”).isNotNull()) If you want to simply drop NULL values you can use na.drop with subset argument: df.na.drop(subset=[“dt_mvmt”]) Equality based comparisons with NULL won’t work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL: sqlContext.sql(“SELECT NULL = NULL”).show() ## +————-+ ## … Read more

Spark – load CSV file as DataFrame?

November 13, 2022 by Tarik

spark-csv is part of core Spark functionality and doesn’t require a separate library. So you could just do for example df = spark.read.format(“csv”).option(“header”, “true”).load(“csvfile.csv”) In scala,(this works for any format-in delimiter mention “,” for csv, “\t” for tsv etc) val df = sqlContext.read.format(“com.databricks.spark.csv”) .option(“delimiter”, “,”) .load(“csvfile.csv”)

How to sort by column in descending order in Spark SQL?

October 26, 2022 by Tarik

You can also sort the column by importing the spark sql functions import org.apache.spark.sql.functions._ df.orderBy(asc(“col1”)) Or import org.apache.spark.sql.functions._ df.sort(desc(“col1″)) importing sqlContext.implicits._ import sqlContext.implicits._ df.orderBy($”col1″.desc) Or import sqlContext.implicits._ df.sort($”col1”.desc)

Show distinct column values in pyspark dataframe

October 26, 2022 by Tarik

This should help to get distinct values of a column: df.select(‘column1′).distinct().collect() Note that .collect() doesn’t have any built-in limit on how many values can return so this might be slow — use .show() instead or add .limit(20) before .collect() to manage this.

Concatenate columns in Apache Spark DataFrame

October 25, 2022 by Tarik

With raw SQL you can use CONCAT: In Python df = sqlContext.createDataFrame([(“foo”, 1), (“bar”, 2)], (“k”, “v”)) df.registerTempTable(“df”) sqlContext.sql(“SELECT CONCAT(k, ‘ ‘, v) FROM df”) In Scala import sqlContext.implicits._ val df = sc.parallelize(Seq((“foo”, 1), (“bar”, 2))).toDF(“k”, “v”) df.registerTempTable(“df”) sqlContext.sql(“SELECT CONCAT(k, ‘ ‘, v) FROM df”) Since Spark 1.5.0 you can use concat function with DataFrame … Read more

How do I add a new column to a Spark DataFrame (using PySpark)?

October 24, 2022 by Tarik

You cannot add an arbitrary column to a DataFrame in Spark. New columns can be created only by using literals (other literal types are described in How to add a constant column in a Spark DataFrame?) from pyspark.sql.functions import lit df = sqlContext.createDataFrame( [(1, “a”, 23.0), (3, “B”, -23.0)], (“x1”, “x2”, “x3”)) df_with_x4 = df.withColumn(“x4”, … Read more

How can I change column types in Spark SQL’s DataFrame?

October 23, 2022 by Tarik

Edit: Newest newest version Since spark 2.x you should use dataset api instead when using Scala [1]. Check docs here: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame If working with python, even though easier, I leave the link here as it’s a very highly voted question: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html >>> df.withColumn(‘age2’, df.age + 2).collect() [Row(age=2, name=”Alice”, age2=4), Row(age=5, name=”Bob”, age2=7)] [1] https://spark.apache.org/docs/latest/sql-programming-guide.html: In … Read more

How to select the first row of each group?

October 21, 2022 by Tarik

Window functions: Something like this should do the trick: import org.apache.spark.sql.functions.{row_number, max, broadcast} import org.apache.spark.sql.expressions.Window val df = sc.parallelize(Seq( (0,”cat26″,30.9), (0,”cat13″,22.1), (0,”cat95″,19.6), (0,”cat105″,1.3), (1,”cat67″,28.5), (1,”cat4″,26.8), (1,”cat13″,12.6), (1,”cat23″,5.3), (2,”cat56″,39.6), (2,”cat40″,29.7), (2,”cat187″,27.9), (2,”cat68″,9.8), (3,”cat8″,35.6))).toDF(“Hour”, “Category”, “TotalValue”) val w = Window.partitionBy($”hour”).orderBy($”TotalValue”.desc) val dfTop = df.withColumn(“rn”, row_number.over(w)).where($”rn” === 1).drop(“rn”) dfTop.show // +—-+——–+———-+ // |Hour|Category|TotalValue| // +—-+——–+———-+ // | 0| … Read more

How to add a constant column in a Spark DataFrame?

October 19, 2022 by Tarik

Spark 2.2+ Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala): import org.apache.spark.sql.functions.typedLit df.withColumn(“some_array”, typedLit(Seq(1, 2, 3))) df.withColumn(“some_struct”, typedLit((“foo”, 1, 0.3))) df.withColumn(“some_map”, typedLit(Map(“key1” -> 1, “key2” -> 2))) Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map): The second argument for DataFrame.withColumn should be a Column so … Read more