When do you start additional Elasticsearch nodes? [closed]

Let’s clarify the terminology a little first: Node: an Elasticsearch instance running (a java process). Usually every node runs on its own machine. Cluster: one or more nodes with the same cluster name. Index: more or less like a database. Type: more or less like a database table. Shard: effectively a lucene index. Every index … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

Spark parquet partitioning : Large number of files

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark)) Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing … Read more

Is there something like Redis DB, but not limited with RAM size? [closed]

Yes, there are two alternatives to Redis that are not limited by RAM size while remaining compatible with Redis protocol: Ardb (C++), replication(Master-Slave/Master-Master): https://github.com/yinqiwen/ardb A redis-protocol compatible persistent storage server, support LevelDB/KyotoCabinet/LMDB as storage engine. Edis (Erlang): https://github.com/cbd/edis Edis is a protocol-compatible Server replacement for Redis, written in Erlang. Edis’s goal is to be a … Read more