ML
Observability

Metrics, Logs, Traces: When to Reach for Which

The three pillars overlap, but they answer different questions. Putting the right signal against the right question is half the win.

January 15, 20269 min readObservabilityArchitecture

Every observability talk shows the same diagram: metrics, logs, traces, three overlapping circles. Then it stops. The actual question — which one do I reach for when X is happening? — gets answered by experience. This article skips the years.

1. What each one is actually good at

  • Metrics — pre-aggregated numbers over time. Cheap to store, cheap to query, terrible at "why this one request was slow."
  • Logs — per-event records. Great at "exactly what happened with this one thing," terrible at "what is the p95 over the last 24h."
  • Traces — the path of a single request through every service. Great at "where did the latency go," irrelevant at scale unless sampled.

2. Use metrics for SLOs, alerts, and dashboards

Anything you want to alert on belongs in metrics. Counters for how often, histograms for how long. Histograms beat averages every time — averages hide tail latency.

// Prometheus client example
const httpDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  labelNames: ["method", "route", "status"],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

Label cardinality is the gotcha. user_id as a label is how you take Prometheus down. Anything with more than ~100 distinct values belongs in a trace attribute, not a metric label.

3. Use traces to find where time went

A trace is one request, broken into spans across every service it touched. The single most useful view is the waterfall — you immediately see whether the slow part is the database call, the downstream API, or the queue wait.

Sample wisely. 100% tracing is expensive and the long tail of traces all look similar. Two common patterns:

  • Head sampling — decide at the entry point (1 in 100 requests). Cheapest, biased toward fast requests.
  • Tail sampling — buffer all spans, decide after the request ends. Keep 100% of errors and slow requests, sample the rest. Costs RAM at the collector, pays for itself the first time you debug a tail-latency bug.

4. Use logs for the "exactly this happened" detail

When the metric tells you "errors spiked at 14:02," the trace tells you which request was slow, and the log tells you which row in which table caused the validation to fail. Logs are the bottom of the funnel.

The mistake most teams make: trying to use logs for everything. You can compute a count from logs, but the cost is 10–100× a metric. You can compute a p95 from logs, but you cannot do it at 14:02 when you need it.

5. The correlation glue: trace_id

Propagate trace_id everywhere. Logs include it, metrics emit exemplars that link to it, the trace itself is keyed by it. With trace_id threaded through, you can jump from a metric spike → exemplar → trace → logs in three clicks. Without it, you are pivoting on timestamp ranges and praying.

6. The decision tree

  1. Alert fires → metrics dashboard.
  2. "Why is this slower than last week?" → metrics histograms, then traces.
  3. "Where did the time go for that one slow request?" → trace waterfall.
  4. "Why did that specific request fail?" → logs, filtered by request_id.
  5. "What is happening right now?" → metrics.
  6. "What happened five minutes ago to that one user?" → logs.

The honest part

You will not get to do everything at once. If you have nothing, start with structured logs and one RED dashboard (rate, errors, duration) per service. Add traces when you have at least two services worth correlating. Add a SLO once the dashboard is honest. Observability is a stack you build floor by floor — skipping floors makes the higher ones useless.

SharePostLinkedIn

Reader Discussion

2 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    Jan 17, 2026·2 days later
  2. Kenta Yamada· Tech LeadAsks

    would love a war-story follow-up. principles are clear; the actual debugging session is where the interesting stuff lives. there's a real shortage of "here's the dashboard, here's the thread we pulled, here's where we got stuck for 90 mins" content.

    Jan 19, 2026·4 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email