distributed-computing – Page 2

Exploding nested Struct in Spark dataframe

September 29, 2023 by Tarik

In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below: var explodedDf2 = explodedDf.select(“department.*”,”*”) https://docs.databricks.com/spark/latest/spark-sql/complex-types.html

Persistent storage for Apache Mesos

September 22, 2023 by Tarik

Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds. Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task’s sandbox and will persist on the node even after the task dies/completes. When the task exits, its … Read more

pyspark : NameError: name ‘spark’ is not defined

September 14, 2023 by Tarik

You can add from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext(‘local’) spark = SparkSession(sc) to the begining of your code to define a SparkSession, then the spark.createDataFrame() should work.

Concatenating datasets of different RDDs in Apache spark using scala

September 12, 2023 by Tarik

I think you are looking for RDD.union val rddPart1 = ??? val rddPart2 = ??? val rddAll = rddPart1.union(rddPart2) Example (on Spark-shell) val rdd1 = sc.parallelize(Seq((1, “Aug”, 30),(1, “Sep”, 31),(2, “Aug”, 15),(2, “Sep”, 10))) val rdd2 = sc.parallelize(Seq((1, “Oct”, 10),(1, “Nov”, 12),(2, “Oct”, 5),(2, “Nov”, 15))) rdd1.union(rdd2).collect res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), … Read more

Python Multiprocessing with Distributed Cluster

September 11, 2023 by Tarik

If you want a very easy solution, there isn’t one. However, there is a solution that has the multiprocessing interface — pathos — which has the ability to establish connections to remote servers through a parallel map, and to do multiprocessing. If you want to have a ssh-tunneled connection, you can do that… or if … Read more

In distributed computing, what are world size and rank?

September 10, 2023 by Tarik

These concepts are related to parallel computing. It would be helpful to learn a little about parallel computing, e.g., MPI. You can think of world as a group containing all the processes for your distributed training. Usually, each GPU corresponds to one process. Processes in the world can communicate with each other, which is why … Read more

Why isn’t Hadoop implemented using MPI?

August 23, 2023 by Tarik

One of the big features of Hadoop/map-reduce is the fault tolerance. Fault tolerance is not supported in most (any?) current MPI implementations. It is being thought about for future versions of OpenMPI. Sandia labs has a version of map-reduce which uses MPI, but it lacks fault tolerance.

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

August 17, 2023 by Tarik

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

Easiest way to install Python dependencies on Spark executor nodes?

August 3, 2023 by Tarik

Actually having actually tried it, I think the link I posted as a comment doesn’t do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn’t supported better in Spark. … Read more

Method to replicate sqlite database across multiple servers

July 27, 2023 by Tarik

I used the Raft consensus protocol to replicate my SQLite database. You can find the system here: https://github.com/rqlite/rqlite