Best way to get the max value in a Spark dataframe column

>df1.show() +—–+——————–+——–+———-+———–+ |floor| timestamp| uid| x| y| +—–+——————–+——–+———-+———–+ | 1|2014-07-19T16:00:…|600dfbe2| 103.79211|71.50419418| | 1|2014-07-19T16:00:…|5e7b40e1| 110.33613|100.6828393| | 1|2014-07-19T16:00:…|285d22e4|110.066315|86.48873585| | 1|2014-07-19T16:00:…|74d917a1| 103.78499|71.45633073| >row1 = df1.agg({“x”: “max”}).collect()[0] >print row1 Row(max(x)=110.33613) >print row1[“max(x)”] 110.33613 The answer is almost the same as method3. but seems the “asDict()” in method3 can be removed

Load CSV file with PySpark

Spark 2.0.0+ You can use built-in csv data source directly: spark.read.csv( “some_input_file.csv”, header=True, mode=”DROPMALFORMED”, schema=schema ) or ( spark.read .schema(schema) .option(“header”, “true”) .option(“mode”, “DROPMALFORMED”) .csv(“some_input_file.csv”) ) without including any external dependencies. Spark < 2.0.0: Instead of manual parsing, which is far from trivial in a general case, I would recommend spark-csv: Make sure that Spark … Read more

Sort in descending order in PySpark

In PySpark 1.3 sort method doesn’t take ascending parameter. You can use desc method instead: from pyspark.sql.functions import col (group_by_dataframe .count() .filter(“`count` >= 10”) .sort(col(“count”).desc())) or desc function: from pyspark.sql.functions import desc (group_by_dataframe .count() .filter(“`count` >= 10”) .sort(desc(“count”)) Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

importing pyspark in python shell

Assuming one of the following: Spark is downloaded on your system and you have an environment variable SPARK_HOME pointing to it You have ran pip install pyspark Here is a simple method (If you don’t bother about how it works!!!) Use findspark Go to your python shell pip install findspark import findspark findspark.init() import the … Read more

Spark Dataframe distinguish columns with duplicated name

Lets start with some data: from pyspark.mllib.linalg import SparseVector from pyspark.sql import Row df1 = sqlContext.createDataFrame([ Row(a=107831, f=SparseVector( 5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})), Row(a=125231, f=SparseVector( 5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})), ]) df2 = sqlContext.createDataFrame([ Row(a=107831, f=SparseVector( 5, {0: 0.0, 1: 0.0, 2: 0.0, … Read more

How to change a dataframe column from String type to Double type in PySpark?

There is no need for an UDF here. Column already provides cast method with DataType instance : from pyspark.sql.types import DoubleType changedTypedf = joindf.withColumn(“label”, joindf[“show”].cast(DoubleType())) or short string: changedTypedf = joindf.withColumn(“label”, joindf[“show”].cast(“double”)) where canonical string names (other variations can be supported as well) correspond to simpleString value. So for atomic types: from pyspark.sql import types … Read more

How to check if spark dataframe is empty?

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you. df.head(1).isEmpty df.take(1).isEmpty with Python equivalent: len(df.head(1)) == 0 # or bool(df.head(1)) len(df.take(1)) == 0 # or bool(df.take(1)) Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)