The value of “spark.yarn.executor.memoryOverhead” setting?

spark.yarn.executor.memoryOverhead Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames –executor-memory/spark.executor.memory controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn.executor.memoryOverhead property … Read more

Dealing with unbalanced datasets in Spark MLlib

Class weight with Spark ML As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here) But If you’re willing to try other classifiers – this functionality has been already added to the Logistic Regression. Consider a case where we have 80% positives (label == 1) in … Read more

How to extract model hyper-parameters from spark.ml in PySpark?

Ran into this problem as well. I found out you need to call the java property for some reason I don’t know why. So just do this: from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator from pyspark.ml.regression import LinearRegression from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator(metricName=”mae”) lr = LinearRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \ .addGrid(lr.regParam, [0]) \ … Read more

Calling Java/Scala function from a task

Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]: Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py). … Read more

Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

The janino error is due the number of constant variables created during the optimizer process. The maximum limit of constant variables allowed in the JVM is ((2^16) -1). If this limit is exceeded, then you get the Constant pool for class … has grown past JVM limit of 0xFFFF The JIRA that will fix this … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)