apache-spark – Page 4

Apache Spark vs Apache Ignite [closed]

November 27, 2023 by Tarik

I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations. Another common use for Ignite is distributed … Read more

PySpark error: AttributeError: ‘NoneType’ object has no attribute ‘_jvm’

September 24, 2023 by Tarik

Mariusz answer didn’t really help me. So if you like me found this because it’s the only result on google and you’re new to pyspark (and spark in general), here’s what worked for me. In my case I was getting that error because I was trying to execute pyspark code before the pyspark environment had … Read more

Querying on multiple Hive stores using Apache Spark

September 24, 2023 by Tarik

I think this is possible by making use of Spark SQL capability of connecting and reading data from remote databases using JDBC. After an exhaustive R & D, I was successfully able to connect to two different hive environments using JDBC and load the hive tables as DataFrames into Spark for further processing. Environment details … Read more

Pyspark: repartition vs partitionBy

September 23, 2023 by Tarik

repartition() is used for specifying the number of partitions considering the number of cores and the amount of data you have. partitionBy() is used for making shuffling functions more efficient, such as reduceByKey(), join(), cogroup() etc.. It is only beneficial in cases where a RDD is used for multiple times, so it is usually followed … Read more

Why does Spark job fail with “too many open files”?

September 22, 2023 by Tarik

This has been answered on the spark user list: The best way is definitely just to increase the ulimit if possible, this is sort of an assumption we make in Spark that clusters will be able to move it around. You might be able to hack around this by decreasing the number of reducers [or … Read more

The value of “spark.yarn.executor.memoryOverhead” setting?

September 22, 2023 by Tarik

spark.yarn.executor.memoryOverhead Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames –executor-memory/spark.executor.memory controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn.executor.memoryOverhead property … Read more

Difference between createOrReplaceTempView and registerTempTable

September 20, 2023 by Tarik

registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable. Other than that registerTempTable and createOrReplaceTempView functionally equivalent and the former one calls the latter one.

pyspark join multiple conditions

September 18, 2023 by Tarik

Quoting from spark docs: (https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join) join(other, on=None, how=None) Joins with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2. Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or … Read more

How to create an empty DataFrame? Why “ValueError: RDD is empty”?

September 17, 2023 by Tarik

extending Joe Widen’s answer, you can actually create the schema with no fields like so: schema = StructType([]) so when you create the DataFrame using that as your schema, you’ll end up with a DataFrame[]. >>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema) DataFrame[] >>> empty.schema StructType(List()) In Scala, if you choose to use sqlContext.emptyDataFrame and check out … Read more

How to split a list to multiple columns in Pyspark?

September 16, 2023 by Tarik

It depends on the type of your “list”: If it is of type ArrayType(): df = hc.createDataFrame(sc.parallelize([[‘a’, [1,2,3]], [‘b’, [2,3,4]]]), [“key”, “value”]) df.printSchema() df.show() root |– key: string (nullable = true) |– value: array (nullable = true) | |– element: long (containsNull = true) you can access the values like you would with python using … Read more