apache-spark – Page 5

pyspark : NameError: name ‘spark’ is not defined

September 14, 2023 by Tarik

You can add from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext(‘local’) spark = SparkSession(sc) to the begining of your code to define a SparkSession, then the spark.createDataFrame() should work.

Does Spark support true column scans over parquet files in S3?

September 13, 2023 by Tarik

How to check status of Spark applications from the command line?

September 13, 2023 by Tarik

If it’s for Spark Standalone or Apache Mesos cluster managers, @sb0709’s answer is the way to follow. For YARN, you should use yarn application command: $ yarn application -help usage: application -appStates <States> Works with -list to filter applications based on input comma-separated list of application states. The valid application state can be one of … Read more

Dealing with unbalanced datasets in Spark MLlib

September 10, 2023 by Tarik

Class weight with Spark ML As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here) But If you’re willing to try other classifiers – this functionality has been already added to the Logistic Regression. Consider a case where we have 80% positives (label == 1) in … Read more

Fill in null with previously known good value with pyspark

September 10, 2023 by Tarik

This uses last and ignores nulls. Let’s re-create something similar to the original data: import sys from pyspark.sql.window import Window import pyspark.sql.functions as func d = [{‘session’: 1, ‘ts’: 1}, {‘session’: 1, ‘ts’: 2, ‘id’: 109}, {‘session’: 1, ‘ts’: 3}, {‘session’: 1, ‘ts’: 4, ‘id’: 110}, {‘session’: 1, ‘ts’: 5}, {‘session’: 1, ‘ts’: 6}] df … Read more

Spark union column order

September 9, 2023 by Tarik

The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation: Return a new DataFrame containing union of rows in this and another frame. This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of … Read more

How to convert List to JavaRDD

September 9, 2023 by Tarik

You’re looking for JavaSparkContext.parallelize(List) and similar. This is just like in the Scala API.

How to drop rows with nulls in one column pyspark

September 4, 2023 by Tarik

Use either drop with subset: df.na.drop(subset=[“col_X”]) or isNotNull() df.filter(df.col_X.isNotNull())

How do I add an persistent column of row ids to Spark DataFrame?

September 3, 2023 by Tarik

Spark 2.0 This is issue has been resolved in Spark 2.0 with SPARK-14241. Another similar issue has been resolved in Spark 2.1 with SPARK-14393 Spark 1.x Problem you experience is rather subtle but can be reduced to a simple fact monotonically_increasing_id is an extremely ugly function. It is clearly not pure and its value depends … Read more

Change Executor Memory (and other configs) for Spark Shell

August 31, 2023 by Tarik

As of spark 1.2.0 you can set memory and cores by giving following arguments to spark-shell. spark-shell –driver-memory 10G –executor-memory 15G –executor-cores 8 to see other options you can give following commands to spark shell spark-shell –help