How to check status of Spark applications from the command line?

If it’s for Spark Standalone or Apache Mesos cluster managers, @sb0709’s answer is the way to follow. For YARN, you should use yarn application command: $ yarn application -help usage: application -appStates <States> Works with -list to filter applications based on input comma-separated list of application states. The valid application state can be one of … Read more

Dealing with unbalanced datasets in Spark MLlib

Class weight with Spark ML As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here) But If you’re willing to try other classifiers – this functionality has been already added to the Logistic Regression. Consider a case where we have 80% positives (label == 1) in … Read more

Fill in null with previously known good value with pyspark

This uses last and ignores nulls. Let’s re-create something similar to the original data: import sys from pyspark.sql.window import Window import pyspark.sql.functions as func d = [{‘session’: 1, ‘ts’: 1}, {‘session’: 1, ‘ts’: 2, ‘id’: 109}, {‘session’: 1, ‘ts’: 3}, {‘session’: 1, ‘ts’: 4, ‘id’: 110}, {‘session’: 1, ‘ts’: 5}, {‘session’: 1, ‘ts’: 6}] df … Read more

How do I add an persistent column of row ids to Spark DataFrame?

Spark 2.0 This is issue has been resolved in Spark 2.0 with SPARK-14241. Another similar issue has been resolved in Spark 2.1 with SPARK-14393 Spark 1.x Problem you experience is rather subtle but can be reduced to a simple fact monotonically_increasing_id is an extremely ugly function. It is clearly not pure and its value depends … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)