apache-spark-sql – Page 27

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

January 13, 2023 by Tarik

You can use method shown here and replace isNull with isnan: from pyspark.sql.functions import isnan, when, count, col df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +——-+———-+—+ |session|timestamp1|id2| +——-+———-+—+ | 0| 0| 3| +——-+———-+—+ or df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() +——-+———-+—+ |session|timestamp1|id2| +——-+———-+—+ | 0| 0| 5| +——-+———-+—+

Overwrite specific partitions in spark dataframe write method

January 6, 2023 by Tarik

Finally! This is now a feature in Spark 2.3.0: SPARK-20236 To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example: spark.conf.set(“spark.sql.sources.partitionOverwriteMode”,”dynamic”) data.write.mode(“overwrite”).insertInto(“partitioned_table”) I recommend doing a repartition based on your partition column before writing, so you won’t end up with 400 … Read more

Renaming column names of a DataFrame in Spark Scala

January 5, 2023 by Tarik

If structure is flat: val df = Seq((1L, “a”, “foo”, 3.0)).toDF df.printSchema // root // |– _1: long (nullable = false) // |– _2: string (nullable = true) // |– _3: string (nullable = true) // |– _4: double (nullable = false) the simplest thing you can do is to use toDF method: val newNames … Read more

Split Spark dataframe string column into multiple columns

January 4, 2023 by Tarik

pyspark.sql.functions.split() is the right approach here – you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it’s very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself: split_col = pyspark.sql.functions.split(df[‘my_str_col’], ‘-‘) df = df.withColumn(‘NAME1’, … Read more

pyspark dataframe filter or include based on list

January 1, 2023 by Tarik

what it says is “df.score in l” can not be evaluated because df.score gives you a column and “in” is not defined on that column type use “isin” The code should be like this: # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, [“id”, “score”]) # … Read more

Renaming columns for PySpark DataFrame aggregates

January 1, 2023 by Tarik

Although I still prefer dplyr syntax, this code snippet will do: import pyspark.sql.functions as sf (df.groupBy(“group”) .agg(sf.sum(‘money’).alias(‘money’)) .show(100)) It gets verbose.

Extract column values of Dataframe as List in Apache Spark

December 29, 2022 by Tarik

This should return the collection containing single list: dataFrame.select(“YOUR_COLUMN_NAME”).rdd.map(r => r(0)).collect() Without the mapping, you just get a Row object, which contains every column from the database. Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r … Read more

How to create an empty DataFrame with a specified schema?

December 27, 2022 by Tarik

Lets assume you want a data frame with the following schema: root |– k: string (nullable = true) |– v: integer (nullable = false) You simply define schema for a data frame and use empty RDD[Row]: import org.apache.spark.sql.types.{ StructType, StructField, StringType, IntegerType} import org.apache.spark.sql.Row val schema = StructType( StructField(“k”, StringType, true) :: StructField(“v”, IntegerType, false) … Read more

Join two data frames, select all columns from one and some columns from the other

December 26, 2022 by Tarik

Asterisk (*) works with alias. Ex: from pyspark.sql.functions import * df1 = df1.alias(‘df1’) df2 = df2.alias(‘df2’) df1.join(df2, df1.id == df2.id).select(‘df1.*’)

Concatenate two PySpark dataframes

December 26, 2022 by Tarik

Maybe you can try creating the unexisting columns and calling union (unionAll for Spark 1.6 or lower): from pyspark.sql.functions import lit cols = [‘id’, ‘uniform’, ‘normal’, ‘normal_2’] df_1_new = df_1.withColumn(“normal_2”, lit(None)).select(cols) df_2_new = df_2.withColumn(“normal”, lit(None)).select(cols) result = df_1_new.union(df_2_new) # To remove the duplicates: result = result.dropDuplicates()