Metrics, Logs, Traces: When to Reach for Which
The three pillars overlap, but they answer different questions. Putting the right signal against the right question is half the win.
Every observability talk shows the same diagram: metrics, logs, traces, three overlapping circles. Then it stops. The actual question — which one do I reach for when X is happening? — gets answered by experience. This article skips the years.
1. What each one is actually good at
- Metrics — pre-aggregated numbers over time. Cheap to store, cheap to query, terrible at "why this one request was slow."
- Logs — per-event records. Great at "exactly what happened with this one thing," terrible at "what is the p95 over the last 24h."
- Traces — the path of a single request through every service. Great at "where did the latency go," irrelevant at scale unless sampled.
2. Use metrics for SLOs, alerts, and dashboards
Anything you want to alert on belongs in metrics. Counters for how often, histograms for how long. Histograms beat averages every time — averages hide tail latency.
// Prometheus client example
const httpDuration = new client.Histogram({
name: "http_request_duration_seconds",
labelNames: ["method", "route", "status"],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
Label cardinality is the gotcha. user_id as a label is how you take Prometheus down. Anything with more than ~100 distinct values belongs in a trace attribute, not a metric label.
3. Use traces to find where time went
A trace is one request, broken into spans across every service it touched. The single most useful view is the waterfall — you immediately see whether the slow part is the database call, the downstream API, or the queue wait.
Sample wisely. 100% tracing is expensive and the long tail of traces all look similar. Two common patterns:
- Head sampling — decide at the entry point (1 in 100 requests). Cheapest, biased toward fast requests.
- Tail sampling — buffer all spans, decide after the request ends. Keep 100% of errors and slow requests, sample the rest. Costs RAM at the collector, pays for itself the first time you debug a tail-latency bug.
4. Use logs for the "exactly this happened" detail
When the metric tells you "errors spiked at 14:02," the trace tells you which request was slow, and the log tells you which row in which table caused the validation to fail. Logs are the bottom of the funnel.
The mistake most teams make: trying to use logs for everything. You can compute a count from logs, but the cost is 10–100× a metric. You can compute a p95 from logs, but you cannot do it at 14:02 when you need it.
5. The correlation glue: trace_id
Propagate trace_id everywhere. Logs include it, metrics emit exemplars that link to it, the trace itself is keyed by it. With trace_id threaded through, you can jump from a metric spike → exemplar → trace → logs in three clicks. Without it, you are pivoting on timestamp ranges and praying.
6. The decision tree
- Alert fires → metrics dashboard.
- "Why is this slower than last week?" → metrics histograms, then traces.
- "Where did the time go for that one slow request?" → trace waterfall.
- "Why did that specific request fail?" → logs, filtered by request_id.
- "What is happening right now?" → metrics.
- "What happened five minutes ago to that one user?" → logs.
The honest part
You will not get to do everything at once. If you have nothing, start with structured logs and one RED dashboard (rate, errors, duration) per service. Add traces when you have at least two services worth correlating. Add a SLO once the dashboard is honest. Observability is a stack you build floor by floor — skipping floors makes the higher ones useless.