apache-spark – Page 16

get datatype of column using pyspark

March 11, 2023 by Tarik

Your question is broad, thus my answer will also be broad. To get the data types of your DataFrame columns, you can use dtypes i.e : >>> df.dtypes [(‘age’, ‘int’), (‘name’, ‘string’)] This means your column age is of type int and name is of type string.

Pyspark: Pass multiple columns in UDF

March 9, 2023 by Tarik

If all columns you want to pass to UDF have the same data type you can use array as input parameter, for example: >>> from pyspark.sql.types import IntegerType >>> from pyspark.sql.functions import udf, array >>> sum_cols = udf(lambda arr: sum(arr), IntegerType()) >>> spark.createDataFrame([(101, 1, 16)], [‘ID’, ‘A’, ‘B’]) \ … .withColumn(‘Result’, sum_cols(array(‘A’, ‘B’))).show() +—+—+—+——+ | … Read more

Spark load data and add filename as dataframe column

March 9, 2023 by Tarik

You can use input_file_name which: Creates a string column for the file name of the current Spark task. from pyspark.sql.functions import input_file_name df.withColumn(“filename”, input_file_name()) Same thing in Scala: import org.apache.spark.sql.functions.input_file_name df.withColumn(“filename”, input_file_name)

“Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

March 7, 2023 by Tarik

Spark Driver in Apache spark

March 6, 2023 by Tarik

The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master. In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port> Its … Read more

Apache Spark vs Akka [closed]

March 4, 2023 by Tarik

Apache Spark is actually built on Akka. Akka is a general purpose framework to create reactive, distributed, parallel and resilient concurrent applications in Scala or Java. Akka uses the Actor model to hide all the thread-related code and gives you really simple and helpful interfaces to implement a scalable and fault-tolerant system easily. A good … Read more

How DAG works under the covers in RDD?

March 2, 2023 by Tarik

Even i have been looking in the web to learn about how spark computes the DAG from the RDD and subsequently executes the task. At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler. The DAG scheduler divides operators into stages of tasks. … Read more

Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4

February 27, 2023 by Tarik

There seems to be a related bug report: https://issues.apache.org/jira/browse/SPARK-10314 Since there seems to be a pull request for this, there might be a chance to soon get a fix for this. From this thread, https://groups.google.com/forum/#!topic/tachyon-users/xb8zwqIjIa4, it looks like Spark is using TRY_CACHE mode to write to Tachyon so the data seems to get lost when … Read more

What is the concept of application, job, stage and task in spark?

February 19, 2023 by Tarik

The main function is the application. When you invoke an action on an RDD, a “job” is created. Jobs are work submitted to Spark. Jobs are divided into “stages” based on the shuffle boundary. This can help you understand. Each stage is further divided into tasks based on the number of partitions in the RDD. … Read more

PySpark: How to fillna values in dataframe for specific columns?

February 16, 2023 by Tarik

df.fillna(0, subset=[‘a’, ‘b’]) There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1