dataframe – Page 2 – Tarik Billa

Fetching distinct values on a column using Spark DataFrame

March 29, 2023 by Tarik

Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record. For example: val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF(“age”, “salary”) // I obtain all different … Read more

spark dataframe drop duplicates and keep first

March 28, 2023 by Tarik

To everyone saying that dropDuplicates keeps the first occurrence – this is not strictly correct. dropDuplicates keeps the ‘first occurrence’ of a sort operation – only if there is 1 partition. See below for some examples. However this is not practical for most Spark datasets. So I’m also including an example of ‘first occurrence’ drop … Read more

How to replace all Null values of a dataframe in Pyspark

March 22, 2023 by Tarik

You can use df.na.fill to replace nulls with zeros, for example: >>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], [‘col’]) >>> df.show() +—-+ | col| +—-+ | 1| | 2| | 3| |null| +—-+ >>> df.na.fill(0).show() +—+ |col| +—+ | 1| | 2| | 3| | 0| +—+

Spark: subtract two DataFrames

February 25, 2023 by Tarik

According to the Scala API docs, doing: dataFrame1.except(dataFrame2) will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

How to get name of dataframe column in PySpark?

February 17, 2023 by Tarik

You can get the names from the schema by doing spark_df.schema.names Printing the schema can be useful to visualize it as well spark_df.printSchema()

How to pivot Spark DataFrame?

January 25, 2023 by Tarik

As mentioned by David Anderson Spark provides pivot function since version 1.6. General syntax looks as follows: df .groupBy(grouping_columns) .pivot(pivot_column, [values]) .agg(aggregate_expressions) Usage examples using nycflights13 and csv format: Python: from pyspark.sql.functions import avg flights = (sqlContext .read .format(“csv”) .options(inferSchema=”true”, header=”true”) .load(“flights.csv”) .na.drop()) flights.registerTempTable(“flights”) sqlContext.cacheTable(“flights”) gexprs = (“origin”, “dest”, “carrier”) aggexpr = avg(“arr_delay”) flights.count() ## … Read more

Renaming columns for PySpark DataFrame aggregates

January 1, 2023 by Tarik

Although I still prefer dplyr syntax, this code snippet will do: import pyspark.sql.functions as sf (df.groupBy(“group”) .agg(sf.sum(‘money’).alias(‘money’)) .show(100)) It gets verbose.

How to create an empty DataFrame with a specified schema?

December 27, 2022 by Tarik

Lets assume you want a data frame with the following schema: root |– k: string (nullable = true) |– v: integer (nullable = false) You simply define schema for a data frame and use empty RDD[Row]: import org.apache.spark.sql.types.{ StructType, StructField, StringType, IntegerType} import org.apache.spark.sql.Row val schema = StructType( StructField(“k”, StringType, true) :: StructField(“v”, IntegerType, false) … Read more

Join two data frames, select all columns from one and some columns from the other

December 26, 2022 by Tarik

Asterisk (*) works with alias. Ex: from pyspark.sql.functions import * df1 = df1.alias(‘df1’) df2 = df2.alias(‘df2’) df1.join(df2, df1.id == df2.id).select(‘df1.*’)

Difference between DataFrame, Dataset, and RDD in Spark

September 27, 2022 by Tarik

First thing is DataFrame was evolved from SchemaRDD. Yes.. conversion between Dataframe and RDD is absolutely possible. Below are some sample code snippets. df.rdd is RDD[Row] Below are some of options to create dataframe. 1) yourrddOffrow.toDF converts to DataFrame. 2) Using createDataFrame of sql context val df = spark.createDataFrame(rddOfRow, schema) where schema can be from … Read more