apache-spark – Page 18

Mac spark-shell Error initializing SparkContext

January 19, 2023 by Tarik

Following steps might help: Get your hostname by using “hostname” command. Make an entry in the /etc/hosts file for your hostname if not present as follows: 127.0.0.1 your_hostname

Spark SQL: apply aggregate functions to a list of columns

January 17, 2023 by Tarik

There are multiple ways of applying aggregate functions to multiple columns. GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum, which can be used directly as follows: Python: df = sqlContext.createDataFrame( [(1.0, 0.3, 1.0), (1.0, 0.5, 0.0), (-1.0, 0.6, 0.5), (-1.0, 5.6, 0.2)], (“col1”, “col2”, … Read more

Is it possible to get the current spark context settings in PySpark?

January 16, 2023 by Tarik

Spark 2.1+ spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

January 15, 2023 by Tarik

Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html >>> from pyspark.sql import Row >>> df = sc.parallelize([ \ … Row(name=”Alice”, age=5, height=80), \ … Row(name=”Alice”, age=5, height=80), \ … Row(name=”Alice”, age=10, height=80)]).toDF() >>> df.dropDuplicates().show() +—+——+—–+ |age|height| name| +—+——+—–+ | 5| 80|Alice| | 10| 80|Alice| +—+——+—–+ >>> df.dropDuplicates([‘name’, ‘height’]).show() +—+——+—–+ |age|height| name| … Read more

Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey

January 14, 2023 by Tarik

groupByKey: Syntax: sparkContext.textFile(“hdfs://”) .flatMap(line => line.split(” “) ) .map(word => (word,1)) .groupByKey() .map((x,y) => (x,sum(y))) groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers. reduceByKey: Syntax: sparkContext.textFile(“hdfs://”) .flatMap(line => line.split(” “)) .map(word => (word,1)) .reduceByKey((x,y)=> (x+y)) Data are combined at each partition, with only … Read more

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

January 13, 2023 by Tarik

You can use method shown here and replace isNull with isnan: from pyspark.sql.functions import isnan, when, count, col df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +——-+———-+—+ |session|timestamp1|id2| +——-+———-+—+ | 0| 0| 3| +——-+———-+—+ or df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() +——-+———-+—+ |session|timestamp1|id2| +——-+———-+—+ | 0| 0| 5| +——-+———-+—+

Overwrite specific partitions in spark dataframe write method

January 6, 2023 by Tarik

Finally! This is now a feature in Spark 2.3.0: SPARK-20236 To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example: spark.conf.set(“spark.sql.sources.partitionOverwriteMode”,”dynamic”) data.write.mode(“overwrite”).insertInto(“partitioned_table”) I recommend doing a repartition based on your partition column before writing, so you won’t end up with 400 … Read more

pyspark dataframe filter or include based on list

January 1, 2023 by Tarik

what it says is “df.score in l” can not be evaluated because df.score gives you a column and “in” is not defined on that column type use “isin” The code should be like this: # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, [“id”, “score”]) # … Read more

How to tune spark executor number, cores and executor memory?

January 1, 2023 by Tarik

The following answer covers the 3 main aspects mentioned in title – number of executors, executor memory and number of cores. There may be other parameters like driver memory and others which I did not address as of this answer, but would like to add in near future. Case 1 Hardware – 6 Nodes, and … Read more

What are the benefits of Apache Beam over Spark/Flink for batch processing?

December 30, 2022 by Tarik

There’s a few things that Beam adds over many of the existing engines. Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There’s no learning/rewriting cliff … Read more