At what situation I can use Dask instead of Apache Spark? [closed]

Question

you may want to read Dask comparison to Apache Spark

Apache Spark is an all-inclusive framework combining distributed
computing, SQL queries, machine learning, and more that runs on the
JVM and is commonly co-deployed with other Big Data frameworks like
Hadoop. It was originally optimized for bulk data ingest and querying
common in data engineering and business analytics but has since
broadened out. Spark is typically used on small to medium sized
cluster but also runs well on a single machine.

Dask is a parallel programming library that combines with the Numeric
Python ecosystem to provide parallel arrays, dataframes, machine
learning, and custom algorithms. It is based on Python and the
foundational C/Fortran stack. Dask was originally designed to
complement other libraries with parallelism, particular for numeric
computing and advanced analytics, but has since broadened out. Dask is
typically used on a single machine, but also runs well on a
distributed cluster.

Generally Dask is smaller and lighter weight than Spark. This means
that it has fewer features and instead is intended to be used in
conjunction with other libraries, particularly those in the numeric
Python ecosystem.

Leave a Comment Cancel reply