pyspark – Page 24 – Tarik Billa

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

January 13, 2023 by Tarik

You can use method shown here and replace isNull with isnan: from pyspark.sql.functions import isnan, when, count, col df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +——-+———-+—+ |session|timestamp1|id2| +——-+———-+—+ | 0| 0| 3| +——-+———-+—+ or df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() +——-+———-+—+ |session|timestamp1|id2| +——-+———-+—+ | 0| 0| 5| +——-+———-+—+

Split Spark dataframe string column into multiple columns

January 4, 2023 by Tarik

pyspark.sql.functions.split() is the right approach here – you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it’s very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself: split_col = pyspark.sql.functions.split(df[‘my_str_col’], ‘-‘) df = df.withColumn(‘NAME1’, … Read more

pyspark dataframe filter or include based on list

January 1, 2023 by Tarik

what it says is “df.score in l” can not be evaluated because df.score gives you a column and “in” is not defined on that column type use “isin” The code should be like this: # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, [“id”, “score”]) # … Read more

Spark Error – Unsupported class file major version

January 1, 2023 by Tarik

Edit Spark 3.0 supports Java 11, so you’ll need to upgrade Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0 Original answer Until Spark supports Java 11, or higher (which would be hopefully be mentioned at the latest documentation … Read more

Renaming columns for PySpark DataFrame aggregates

January 1, 2023 by Tarik

Although I still prefer dplyr syntax, this code snippet will do: import pyspark.sql.functions as sf (df.groupBy(“group”) .agg(sf.sum(‘money’).alias(‘money’)) .show(100)) It gets verbose.

Pyspark: Exception: Java gateway process exited before sending the driver its port number

December 28, 2022 by Tarik

One possible reason is JAVA_HOME is not set because java is not installed. I encountered the same issue. It says Exception in thread “main” java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:643) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:277) at java.net.URLClassLoader.access$000(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:212) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:296) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) … Read more

Join two data frames, select all columns from one and some columns from the other

December 26, 2022 by Tarik

Asterisk (*) works with alias. Ex: from pyspark.sql.functions import * df1 = df1.alias(‘df1’) df2 = df2.alias(‘df2’) df1.join(df2, df1.id == df2.id).select(‘df1.*’)

Concatenate two PySpark dataframes

December 26, 2022 by Tarik

Maybe you can try creating the unexisting columns and calling union (unionAll for Spark 1.6 or lower): from pyspark.sql.functions import lit cols = [‘id’, ‘uniform’, ‘normal’, ‘normal_2’] df_1_new = df_1.withColumn(“normal_2”, lit(None)).select(cols) df_2_new = df_2.withColumn(“normal”, lit(None)).select(cols) result = df_1_new.union(df_2_new) # To remove the duplicates: result = result.dropDuplicates()

How to fix ‘TypeError: an integer is required (got type bytes)’ error when trying to run pyspark after installing spark 2.4.4

December 26, 2022 by Tarik

This is happening because you’re using python 3.8. The latest pip release of pyspark (pyspark 2.4.4 at time of writing) doesn’t support python 3.8. Downgrade to python 3.7 for now, and you should be fine.

Convert pyspark string to date format

December 19, 2022 by Tarik

Update (1/10/2018): For Spark 2.2+ the best way to do this is probably using the to_date or to_timestamp functions, which both support the format argument. From the docs: >>> from pyspark.sql.functions import to_timestamp >>> df = spark.createDataFrame([(‘1997-02-28 10:30:00’,)], [‘t’]) >>> df.select(to_timestamp(df.t, ‘yyyy-MM-dd HH:mm:ss’).alias(‘dt’)).collect() [Row(dt=datetime.datetime(1997, 2, 28, 10, 30))] Original Answer (for Spark < 2.2) It … Read more