Persistent storage for Apache Mesos

Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds. Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task’s sandbox and will persist on the node even after the task dies/completes. When the task exits, its … Read more

Concatenating datasets of different RDDs in Apache spark using scala

I think you are looking for RDD.union val rddPart1 = ??? val rddPart2 = ??? val rddAll = rddPart1.union(rddPart2) Example (on Spark-shell) val rdd1 = sc.parallelize(Seq((1, “Aug”, 30),(1, “Sep”, 31),(2, “Aug”, 15),(2, “Sep”, 10))) val rdd2 = sc.parallelize(Seq((1, “Oct”, 10),(1, “Nov”, 12),(2, “Oct”, 5),(2, “Nov”, 15))) rdd1.union(rdd2).collect res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), … Read more

Python Multiprocessing with Distributed Cluster

If you want a very easy solution, there isn’t one. However, there is a solution that has the multiprocessing interface — pathos — which has the ability to establish connections to remote servers through a parallel map, and to do multiprocessing. If you want to have a ssh-tunneled connection, you can do that… or if … Read more

In distributed computing, what are world size and rank?

These concepts are related to parallel computing. It would be helpful to learn a little about parallel computing, e.g., MPI. You can think of world as a group containing all the processes for your distributed training. Usually, each GPU corresponds to one process. Processes in the world can communicate with each other, which is why … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

Easiest way to install Python dependencies on Spark executor nodes?

Actually having actually tried it, I think the link I posted as a comment doesn’t do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn’t supported better in Spark. … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)