How dangerous are high-cardinality labels in Prometheus?

High-cardinality labels (e.g. labels with big number of unique values) aren’t dangerous on their own. The danger is in the total number of active time series.
A single Prometheus instance can handle up to ten millions of active time series according to https://www.robustperception.io/why-does-prometheus-use-so-much-ram when running on a host with >100GB of RAM.

An example: suppose the exported metric has a step_id label with 10K unique values.

If the metric has no other labels (e.g. if it is exported as wfengine_duration_seconds{step_id="...}), then it will generate 10K active time series (tiny value for Prometheus).

If the metric contains another label such as workflow_id with 100 unique values and each workflow has 10K unique steps, then the total number of exported time series skyrockets to 100*10K=1M. This is still pretty low number of active time series for Prometheus.

Now suppose that the app, which exports the metric, runs on 50 hosts (or Kubernetes pods). Prometheus stores the scrape target address in the instance label – see these docs. This means that the total number of active time series collected from 50 hosts jumps to 50*1M=50M. This number may be too big for a single Prometheus instance. There are other systems, which can handle such amount of active time series in a single-node setup, but they also have upper limit. It is just N times bigger (1 < N < 10).

So the rule of thumb is to take into account the number of active time series, not the number of unique values per a single label.

Leave a Comment