Apache Spark vs Apache Ignite [closed]

I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations. Another common use for Ignite is distributed … Read more

PySpark error: AttributeError: ‘NoneType’ object has no attribute ‘_jvm’

Mariusz answer didn’t really help me. So if you like me found this because it’s the only result on google and you’re new to pyspark (and spark in general), here’s what worked for me. In my case I was getting that error because I was trying to execute pyspark code before the pyspark environment had … Read more

Querying on multiple Hive stores using Apache Spark

I think this is possible by making use of Spark SQL capability of connecting and reading data from remote databases using JDBC. After an exhaustive R & D, I was successfully able to connect to two different hive environments using JDBC and load the hive tables as DataFrames into Spark for further processing. Environment details … Read more

Pyspark: repartition vs partitionBy

repartition() is used for specifying the number of partitions considering the number of cores and the amount of data you have. partitionBy() is used for making shuffling functions more efficient, such as reduceByKey(), join(), cogroup() etc.. It is only beneficial in cases where a RDD is used for multiple times, so it is usually followed … Read more

The value of “spark.yarn.executor.memoryOverhead” setting?

spark.yarn.executor.memoryOverhead Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames –executor-memory/spark.executor.memory controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn.executor.memoryOverhead property … Read more

Difference between createOrReplaceTempView and registerTempTable

registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable. Other than that registerTempTable and createOrReplaceTempView functionally equivalent and the former one calls the latter one.

pyspark join multiple conditions

Quoting from spark docs: (https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join) join(other, on=None, how=None) Joins with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2. Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or … Read more

How to create an empty DataFrame? Why “ValueError: RDD is empty”?

extending Joe Widen’s answer, you can actually create the schema with no fields like so: schema = StructType([]) so when you create the DataFrame using that as your schema, you’ll end up with a DataFrame[]. >>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema) DataFrame[] >>> empty.schema StructType(List()) In Scala, if you choose to use sqlContext.emptyDataFrame and check out … Read more

How to split a list to multiple columns in Pyspark?

It depends on the type of your “list”: If it is of type ArrayType(): df = hc.createDataFrame(sc.parallelize([[‘a’, [1,2,3]], [‘b’, [2,3,4]]]), [“key”, “value”]) df.printSchema() df.show() root |– key: string (nullable = true) |– value: array (nullable = true) | |– element: long (containsNull = true) you can access the values like you would with python using … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)