ElasticSearch – high indexing throughput

Question

Long story short, I ended up with 5 virtual linux machines, 8 cpu, 16 GB, using puppet to deploy elasticsearch.
My documents got a little bigger, but so did the throuhgput rate (slightly).
I was able to reach 150K index requests / second on average, indexing 1 billion documents in 2 hours.
Throughput is not constant, and i observed similar diminishing throughput behavior as before, but to a lesser extent. Since I will be using daily indices for same amount of data, I would expect these performance metrics to be roughly similar every day.

The transition from windows machines to linux was primarily due to convenience and compliance with IT conventions. Though i don’t know for sure, I suspect the same results could be achieved on windows as well.

In several of my trials I attempted indexing without specifying document ids as Christian Dahlqvist suggested. The results were astonishing. I observed a significant throughput increase, reaching 300k and higher in some cases. The conclusion of this is obvious: Do not specify document ids, unless you absolutely have to.

Also, i’m using less shards per machine, which also contributed to throughput increase.

Leave a Comment Cancel reply