apache-spark – Page 2

How to find the master URL for an existing spark cluster

December 28, 2023 by Tarik

I found that doing –master yarn-cluster works best. this makes sure that spark uses all the nodes of the hadoop cluster.

Debugging “Managed memory leak detected” in Spark 1.6.0

December 27, 2023 by Tarik

The short answer is that users are not supposed to see this message. Users are not supposed to be able to create memory leaks in the unified memory manager. That such leaks happen is a Spark bug: SPARK-11293 But if you want to understand the cause of a memory leak, this is how I did … Read more

Syntax while setting schema for Pyspark.sql using StructType

December 27, 2023 by Tarik

It means if the column allows null values, true for nullable, and false for not nullable StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can … Read more

How to pass whole Row to UDF – Spark DataFrame filter

December 26, 2023 by Tarik

You have to use struct() function for constructing the row while making a call to the function, follow these steps. Import Row, import org.apache.spark.sql._ Define the UDF def myFilterFunction(r:Row) = {r.get(0)==r.get(1)} Register the UDF sqlContext.udf.register(“myFilterFunction”, myFilterFunction _) Create the dataFrame val records = sqlContext.createDataFrame(Seq((“sachin”, “sachin”), (“aggarwal”, “aggarwal1”))).toDF(“text”, “text2”) Use the UDF records.filter(callUdf(“myFilterFunction”,struct($”text”,$”text2″))).show When u want … Read more

How to change job/stage description in web UI?

December 25, 2023 by Tarik

That’s where one of the very uncommon features of Spark Core called local properties applies so well. Spark SQL uses it to group different Spark jobs under a single structured query so you can use SQL tab and navigate easily. You can control local properties using SparkContext.setLocalProperty: Set a local property that affects jobs submitted … Read more

Why does Spark report “java.net.URISyntaxException: Relative path in absolute URI” when working with DataFrames?

December 25, 2023 by Tarik

It’s the SPARK-15565 issue in Spark 2.0 on Windows with a simple solution (that appears to be part of Spark’s codebase that may soon be released as 2.0.2 or 2.1.0). The solution in Spark 2.0.0 is to set spark.sql.warehouse.dir to some properly-referenced directory, say file:///c:/Spark/spark-2.0.0-bin-hadoop2.7/spark-warehouse that uses /// (triple slashes). Start spark-shell with –conf argument … Read more

How to do left outer join in spark sql?

December 24, 2023 by Tarik

I don’t see any issues in your code. Both “left join” or “left outer join” will work fine. Please check the data again the data you are showing is for matches. You can also perform Spark SQL join by using: // Left outer join explicit df1.join(df2, df1[“col1”] == df2[“col1”], “left_outer”)

How to handle null values when writing to parquet from Spark

December 24, 2023 by Tarik

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns. The problem is that null alone carries no type information at all scala> spark.sql(“SELECT null as comments”).printSchema root |– comments: null (nullable = true) As per comment by Michael Armbrust all you have to do is cast: scala> spark.sql(“””SELECT CAST(null as DOUBLE) AS … Read more

Best Practice to launch Spark Applications via Web Application?

December 24, 2023 by Tarik

Very basic answer: Basically you can use SparkLauncher class to launch Spark applications and add some listeners to watch progress. However you may be interested in Livy server, which is a RESTful Sever for Spark jobs. As far as I know, Zeppelin is using Livy to submit jobs and retrieve status. You can also use … Read more

Google Dataflow vs Apache Spark

December 23, 2023 by Tarik