pyspark – Page 25 – Tarik Billa

Best way to get the max value in a Spark dataframe column

December 17, 2022 by Tarik

>df1.show() +—–+——————–+——–+———-+———–+ |floor| timestamp| uid| x| y| +—–+——————–+——–+———-+———–+ | 1|2014-07-19T16:00:…|600dfbe2| 103.79211|71.50419418| | 1|2014-07-19T16:00:…|5e7b40e1| 110.33613|100.6828393| | 1|2014-07-19T16:00:…|285d22e4|110.066315|86.48873585| | 1|2014-07-19T16:00:…|74d917a1| 103.78499|71.45633073| >row1 = df1.agg({“x”: “max”}).collect()[0] >print row1 Row(max(x)=110.33613) >print row1[“max(x)”] 110.33613 The answer is almost the same as method3. but seems the “asDict()” in method3 can be removed

Load CSV file with PySpark

December 17, 2022 by Tarik

Spark 2.0.0+ You can use built-in csv data source directly: spark.read.csv( “some_input_file.csv”, header=True, mode=”DROPMALFORMED”, schema=schema ) or ( spark.read .schema(schema) .option(“header”, “true”) .option(“mode”, “DROPMALFORMED”) .csv(“some_input_file.csv”) ) without including any external dependencies. Spark < 2.0.0: Instead of manual parsing, which is far from trivial in a general case, I would recommend spark-csv: Make sure that Spark … Read more

Sort in descending order in PySpark

December 15, 2022 by Tarik

In PySpark 1.3 sort method doesn’t take ascending parameter. You can use desc method instead: from pyspark.sql.functions import col (group_by_dataframe .count() .filter(“`count` >= 10”) .sort(col(“count”).desc())) or desc function: from pyspark.sql.functions import desc (group_by_dataframe .count() .filter(“`count` >= 10”) .sort(desc(“count”)) Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

importing pyspark in python shell

December 12, 2022 by Tarik

Assuming one of the following: Spark is downloaded on your system and you have an environment variable SPARK_HOME pointing to it You have ran pip install pyspark Here is a simple method (If you don’t bother about how it works!!!) Use findspark Go to your python shell pip install findspark import findspark findspark.init() import the … Read more

How to kill a running Spark application?

December 11, 2022 by Tarik

copy paste the application Id from the spark scheduler, for instance application_1428487296152_25597 connect to the server that have launch the job yarn application -kill application_1428487296152_25597

Spark Dataframe distinguish columns with duplicated name

December 9, 2022 by Tarik

Lets start with some data: from pyspark.mllib.linalg import SparseVector from pyspark.sql import Row df1 = sqlContext.createDataFrame([ Row(a=107831, f=SparseVector( 5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})), Row(a=125231, f=SparseVector( 5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})), ]) df2 = sqlContext.createDataFrame([ Row(a=107831, f=SparseVector( 5, {0: 0.0, 1: 0.0, 2: 0.0, … Read more

How to delete columns in pyspark dataframe

December 1, 2022 by Tarik

Reading the Spark documentation I found an easier solution. Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe. You can use it in two ways df.drop(‘age’) df.drop(df.age) Pyspark Documentation – Drop

How to change a dataframe column from String type to Double type in PySpark?

November 22, 2022 by Tarik

There is no need for an UDF here. Column already provides cast method with DataType instance : from pyspark.sql.types import DoubleType changedTypedf = joindf.withColumn(“label”, joindf[“show”].cast(DoubleType())) or short string: changedTypedf = joindf.withColumn(“label”, joindf[“show”].cast(“double”)) where canonical string names (other variations can be supported as well) correspond to simpleString value. So for atomic types: from pyspark.sql import types … Read more

How to find the size or shape of a DataFrame in PySpark?

November 21, 2022 by Tarik

You can get its shape with: print((df.count(), len(df.columns)))

How to check if spark dataframe is empty?

November 18, 2022 by Tarik

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you. df.head(1).isEmpty df.take(1).isEmpty with Python equivalent: len(df.head(1)) == 0 # or bool(df.head(1)) len(df.take(1)) == 0 # or bool(df.take(1)) Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. … Read more