Metric Cardinality Explosions: How One Label Took Down Prometheus
Someone added a single label to a counter to make a dashboard a little more useful. A week later the Prometheus server was using 40GB of RAM, queries timed out, and during the next incident the monitoring was down right when we needed it most. The label was user_id, and it had quietly created millions of time series.
Our Prometheus had run happily on a few gigabytes of memory for a year. Then over the course of a week it crept up to 40GB, started getting OOM-killed, and queries that used to return instantly began timing out. The breaking point came during an actual incident: we went to look at the dashboards and the monitoring itself was down, which is about the worst time for that to happen. The change that caused it was one line in a pull request that nobody flagged, because it looked like an improvement. Someone had added a label to a metric.
Every label combination is a separate time series
The thing to internalize about Prometheus, and most dimensional metrics systems, is that a metric is not one number. It is a separate stored time series for every unique combination of label values. A counter with a method label (say 5 values) and a status label (say 8 values) is up to 40 series. That is totally fine. The cost is the product of the cardinalities of each label, and as long as every label has a small, bounded set of values, the product stays small.
// bounded labels: method has ~5 values, status ~8. Product is tiny.
httpRequests
.labels({ method: "GET", route: "/checkout", status: "200" })
.inc();
The disaster happens when a label's set of values is unbounded, meaning it grows with your data or your users instead of being a fixed small list. The instant you put something like a user ID, an email, a full URL with IDs in it, or a request ID into a label, the number of series stops being a small product and starts tracking the size of your user base or your traffic.
The line that did it
The well-meaning change put the request path into a label to see traffic per endpoint. But the paths contained IDs.
// looks reasonable, is catastrophic
httpRequests
.labels({ method: "GET", route: "/user/" + userId + "/orders", status: "200" })
.inc();
Every distinct userId produces a brand-new route value, and therefore a brand-new time series that Prometheus must hold in memory and keep until retention expires. With two million users, that one metric became up to two million series. Each series carries its labels and a chunk of in-memory state, on the order of a few kilobytes, so a couple million series is multiple gigabytes for a single metric. Multiply by a few such mistakes and you get 40GB and an OOM loop.
Normalize the label to a bounded set
The fix is to make the label value come from a small fixed set again. For routes, that means using the route template, not the filled-in path. The ID belongs in the value of the request, not in the identity of the metric.
// bounded again: the template is one of a fixed handful of routes
httpRequests
.labels({ method: "GET", route: "/user/:id/orders", status: "200" })
.inc();
Most web frameworks expose the matched route pattern (/user/:id/orders) separately from the actual URL; use that. The rule is simple to state: a label value must be drawn from a set you could write down in advance. Method, status code, route template, region, and cache-hit-or-miss are all fine. User ID, session ID, full path, error message, and timestamp are not, because you cannot enumerate them ahead of time.
Where the high-cardinality data should go
This does not mean per-user information is forbidden, it means metrics are the wrong tool for it. The three pillars split the work: metrics answer "how many, how fast, what rate" over bounded dimensions; logs and traces answer "what happened to this specific user or request" over unbounded ones. If you need to find the one slow checkout for user 12345, that is a trace lookup or a log query, not a metric label. Exemplars bridge the two: a metric stays low-cardinality, but a sampled bucket can carry a trace ID that jumps you to the specific request, giving you the drill-down without paying the cardinality cost on every series.
Put a ceiling in place so it can't recur
Relying on every engineer to remember this forever does not scale. Prometheus lets you cap the damage: sample_limit on a scrape will drop a target that exposes too many series instead of letting it blow up the whole server, and you can set per-metric label limits. It is far better to lose one misbehaving target's data and get an alert than to take down the monitoring for everything. Add a meta-alert on prometheus_tsdb_head_series trending up so a slow cardinality leak pages you while it is still small, not after it has eaten all the memory.
Rules of thumb
- A metric's cost is the product of its label cardinalities. One unbounded label multiplies that product by your user count or traffic volume.
- A label value must come from a set you could enumerate in advance: method, status, route template, region. Never user ID, full path, email, or request ID.
- Use the route template (
/user/:id), never the filled-in path. The identifier goes in a trace or log, not in the metric's identity. - High-cardinality "what happened to this one request" questions belong to logs and traces; metrics are for bounded aggregates. Exemplars link the two.
- Set
sample_limitand label limits so a single bad target gets dropped instead of OOM-ing the whole server. - Alert on total head series trending up. A cardinality leak is cheap to fix when it's small and an outage when it's large.