bigdata – Tarik Billa

Dynamodb query error – Query key condition not supported

December 16, 2023 by Tarik

When do you start additional Elasticsearch nodes? [closed]

September 6, 2023 by Tarik

Let’s clarify the terminology a little first: Node: an Elasticsearch instance running (a java process). Usually every node runs on its own machine. Cluster: one or more nodes with the same cluster name. Index: more or less like a database. Type: more or less like a database table. Shard: effectively a lucene index. Every index … Read more

Machine Learning & Big Data [closed]

August 25, 2023 by Tarik

First of all, your question needs to define more clearly what you intend by Big Data. Indeed, Big Data is a buzzword that may refer to various size of problems. I tend to define Big Data as the category of problems where the Data size or the Computation time is big enough for “the hardware … Read more

How can I tell when my dataset in R is going to be too large?

August 25, 2023 by Tarik

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

August 17, 2023 by Tarik

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

What methods can we use to reshape VERY large data sets?

August 10, 2023 by Tarik

How to get started with Big Data Analysis [closed]

August 3, 2023 by Tarik

Recommended package for very large dataset processing and machine learning in R [closed]

July 26, 2023 by Tarik

Spark parquet partitioning : Large number of files

July 12, 2023 by Tarik

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark)) Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing … Read more

Is there something like Redis DB, but not limited with RAM size? [closed]

June 9, 2023 by Tarik

Yes, there are two alternatives to Redis that are not limited by RAM size while remaining compatible with Redis protocol: Ardb (C++), replication(Master-Slave/Master-Master): https://github.com/yinqiwen/ardb A redis-protocol compatible persistent storage server, support LevelDB/KyotoCabinet/LMDB as storage engine. Edis (Erlang): https://github.com/cbd/edis Edis is a protocol-compatible Server replacement for Redis, written in Erlang. Edis’s goal is to be a … Read more