distributed-computing – Page 3

Paxos vs two phase commit

June 9, 2023 by Tarik

2PC blocks if the transaction manager fails, requiring human intervention to restart. 3PC algorithms (there are several such algorithms) try to fix 2PC by electing a new transaction manager when the original manager fails. Paxos does not block as long as a majority of processes (managers) are correct. Paxos actually solves the more general problem … Read more

How does Spark aggregate function – aggregateByKey work?

May 24, 2023 by Tarik

aggregateByKey() is quite different from reduceByKey. What happens is that reduceByKey is sort of a particular case of aggregateByKey. aggregateByKey() will combine the values for a particular key, and the result of such combination can be any object that you specify. You have to specify how the values are combined (“added”) inside one partition (that … Read more

How does pytorch’s parallel method and distributed method work?

May 18, 2023 by Tarik

That’s a great question. PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer. This container parallelizes the application of the given :attr:module by splitting the input across the specified … Read more

What is spark.driver.maxResultSize?

May 13, 2023 by Tarik

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize). No. If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from … Read more

Flattening Rows in Spark

May 2, 2023 by Tarik

You can use explode function: scala> import org.apache.spark.sql.functions.explode import org.apache.spark.sql.functions.explode scala> val test = sqlContext.read.json(sc.parallelize(Seq(“””{“a”:1,”b”:[2,3]}”””))) test: org.apache.spark.sql.DataFrame = [a: bigint, b: array<bigint>] scala> test.printSchema root |– a: long (nullable = true) |– b: array (nullable = true) | |– element: long (containsNull = true) scala> val flattened = test.withColumn(“b”, explode($”b”)) flattened: org.apache.spark.sql.DataFrame = [a: bigint, … Read more

What is a task in Spark? How does the Spark worker execute the jar file?

April 30, 2023 by Tarik

When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down. RDDs are … Read more

Why isn’t RDBMS Partition Tolerant in CAP Theorem and why is it Available?

April 1, 2023 by Tarik

It is very easy to misunderstand the CAP properties, hence I’m providing some illustrations to make it easier. Consistency: A query Q will produce the same answer A regardless the node that handles the request. In order to guarantee full consistency we need to ensure that all nodes agree on the same value at all … Read more

“Eventual Consistency” vs “Strong Eventual Consistency” vs “Strong Consistency”?

March 31, 2023 by Tarik

DISCLAIMER: The text below should give you a rough idea about the differences among Eventual Consistency, Strong Eventual Consistency and Strong Consistency. But they are in some way an over-simplification. So take them with a grain of salt 😉 First things first: when we talk about consistency we refer to an scenario where different entities … Read more