dataframe – Tarik Billa

PySpark DataFrame Column Reference: df.col vs. df[‘col’] vs. F.col(‘col’)?

December 14, 2023 by Tarik

In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same. We can illustrate with a small example: df = spark.createDataFrame( [(1,’a’, 0), (2,’b’,None), (None,’c’,3)], [‘col’, ‘2col’, ‘third col’] ) df.show() #+—-+—-+———+ #| col|2col|third col| #+—-+—-+———+ #| 1| a| … Read more

Julia DataFrame: remove column by name

August 22, 2023 by Tarik

You can use select!: julia> df = DataFrame(A = 1:4, B = [“M”, “F”, “F”, “M”], C = 2:5) 4×3 DataFrame |——-|—|—–|—| | Row # | A | B | C | | 1 | 1 | “M” | 2 | | 2 | 2 | “F” | 3 | | 3 | 3 | … Read more

How to read csv without header and name them with names while reading in pyspark?

July 29, 2023 by Tarik

You can import the csv file into a dataframe with a predefined schema. The way you define a schema is by using the StructType and StructField objects. Assuming your data is all IntegerType data: from pyspark.sql.types import StructType, StructField, IntegerType schema = StructType([ StructField(“member_srl”, IntegerType(), True), StructField(“click_day”, IntegerType(), True), StructField(“productid”, IntegerType(), True)]) df = spark.read.csv(“user_click_seq.csv”,header=False,schema=schema) … Read more

How to query JSON data column using Spark DataFrames?

July 9, 2023 by Tarik

Spark >= 2.4 If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema). import org.apache.spark.sql.functions.{lit, schema_of_json, from_json} import collection.JavaConverters._ val schema = schema_of_json(lit(df.select($”jsonData”).as[String].first)) df.withColumn(“jsonData”, from_json($”jsonData”, schema, Map[String, String]().asJava)) Spark >= 2.1 You can use from_json function: import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.types._ … Read more

julia create an empty dataframe and append rows to it

May 28, 2023 by Tarik

A zero length array defined using only [] will lack sufficient type information. julia> typeof([]) Array{None,1} So to avoid that problem is to simply indicate the type. julia> typeof(Int64[]) Array{Int64,1} And you can apply that to your DataFrame problem julia> df = DataFrame(A = Int64[], B = Int64[]) 0x2 DataFrame julia> push!(df, [3 6]) julia> … Read more

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

May 20, 2023 by Tarik

You’ll need to use a left_anti join in this case. The left anti join is the opposite of a left semi join. It filters out data from the right table in the left table according to a given key : largeDataFrame .join(smallDataFrame, Seq(“some_identifier”),”left_anti”) .show // +—————+———-+ // |some_identifier|first_name| // +—————+———-+ // | 222| mary| // … Read more

Difference between DataFrame, Dataset, and RDD in Spark

May 13, 2023 by Tarik

First thing is DataFrame was evolved from SchemaRDD. Yes.. conversion between Dataframe and RDD is absolutely possible. Below are some sample code snippets. df.rdd is RDD[Row] Below are some of options to create dataframe. 1) yourrddOffrow.toDF converts to DataFrame. 2) Using createDataFrame of sql context val df = spark.createDataFrame(rddOfRow, schema) where schema can be from … Read more

What’s the equivalent of Panda’s value_counts() in PySpark?

May 13, 2023 by Tarik

It’s more or less the same: spark_df.groupBy(‘column_name’).count().orderBy(‘count’) In the groupBy you can have multiple columns delimited by a , For example groupBy(‘column_1’, ‘column_2’)

pandas dataframe multiply with a series [duplicate]

May 11, 2023 by Tarik

What’s wrong with result = dataframe.mul(series, axis=0) ? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mul.html#pandas.DataFrame.mul

Spark dataframe: collect () vs select ()

April 5, 2023 by Tarik

Actions vs Transformations Collect (Action) – Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. spark-sql doc select(*cols) (transformation) – Projects a set of expressions and returns a new DataFrame. Parameters: … Read more