ML
Kubernetes

Liveness, Readiness, and Startup Probes: The Three Health Checks and How They Differ

Readiness gates traffic, liveness restarts containers, startup protects slow boots — and wiring them up the same way is how a slow dependency turns into a restart loop. The differences that matter.

June 14, 20269 min readKubernetes

Kubernetes gives a container three health checks, and the most common production incident with them is treating all three as "is the app up?". They answer three different questions, and pointing them at the same endpoint is how a slow database turns a healthy pod into a CrashLoopBackOff. Here is what each one actually controls.

Readiness: should traffic come to this pod?

A failing readiness probe does not restart the container. It removes the pod from the Service's Endpoints, so no new traffic is routed to it; existing connections are untouched. When it passes again, the pod is added back. This is the probe for transient "I'm temporarily busy" states — warming a cache, waiting on a dependency, draining before shutdown.

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

Liveness: should this container be restarted?

A failing liveness probe kills the container and lets the kubelet restart it (subject to the backoff). It exists for one situation only: the process is running but wedged — a deadlock, an event loop stuck, a state a restart fixes. If a restart would not help, it should not be a liveness failure.

livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

The mistake that causes restart storms

The anti-pattern is making /livez check downstream dependencies — the database, a cache, another service. Now picture the database getting slow. Every pod's liveness probe fails, so Kubernetes restarts all of them at once. Restarting an app never fixes a slow database, so they fail again, and you have converted a degraded dependency into a full outage plus a thundering herd of reconnects.

The rule: liveness checks only "is my own process wedged?" (usually a trivial in-process handler). Readiness checks "can I serve a real request right now?" including dependencies. A slow DB should make pods unready (stop sending them traffic) — never make them get killed.

Startup: protect slow boots from liveness

Apps with a long cold start (JVM warmup, large model load, migrations) hit a chicken-and-egg problem: a liveness probe aggressive enough for steady state will kill the container before it finishes booting. The naive fix is a big initialDelaySeconds on liveness, but that also delays detection of real hangs forever after.

The startup probe solves it cleanly: while it is running, liveness and readiness are disabled. Once it succeeds once, it never runs again and the other two take over. Size it for the worst-case boot:

startupProbe:
  httpGet:
    path: /livez
    port: 8080
  periodSeconds: 5
  failureThreshold: 30      # allows up to 5 * 30 = 150s to boot

This gives a 150-second budget to start while keeping a tight 30-second liveness window once the app is healthy.

Rules of thumb

  • Readiness = traffic gate (no restart). Liveness = restart trigger. Startup = a one-time grace period that gates the other two.
  • Never check external dependencies in liveness — that is what turns a slow dependency into a cluster-wide restart loop. Put dependency checks in readiness.
  • If a restart would not fix the failure, it must not be a liveness failure.
  • Slow boot? Add a startup probe with a generous failureThreshold instead of inflating liveness initialDelaySeconds.
  • Keep probe handlers cheap and dependency-light; an expensive probe under load becomes its own failure source.
SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Anders Lindqvist· Staff SREStory

    the preStop sleep trick. THE preStop sleep trick. we spent 3 days debugging mystery 5xx during deploys and the answer was a 10-second sleep. there should be a billboard.

    Jun 15, 2026·1 day later
  2. Vasili Kurov· Platform EngineerFrom experience

    HPA on CPU only is the silent killer. moved ours to requests-per-pod via prometheus-adapter and our scaling went from "vaguely correct" to actually correlated with load. 2 hours of work, immediate ROI.

    Jun 18, 2026·4 days later
  3. Tiến Hồ🇻🇳 Hà Nội· DevOps EngineerAgrees

    PDB = thứ 90% team mình bỏ qua đến khi GKE node upgrade làm 3 pod down một lúc. minAvailable: 2 cộng với replicas: 3 là default mình deploy bây giờ, không cần suy nghĩ.

    Jun 17, 2026·3 days later
  4. Priscilla Owens· Backend LeadPushback

    small pushback — "never set CPU limits" is too strong imo. on cgroups v2 with steady traffic profiles, soft caps prevent one noisy neighbour from starving the whole node. it's a per-workload call. great post otherwise.

    Jun 21, 2026·1 week later·edited
  5. Jiwoo Park· Junior EngineerKind words

    the "control loop, not deploy tool" framing finally made k8s click for me. been fighting with it for 4 months. wish onboarding docs led with this paragraph instead of YAML.

    Jun 16, 2026·2 days later
  6. Léa Dubois· SREAsks

    any chance you'd publish these as a PDF collection? would love to print and read offline on flights. screen-fatigue is real.

    Jun 20, 2026·6 days later
  7. Ahmed Rahman· Full StackKind words

    concise + opinionated = my favourite kind of engineering post. so many blogs hedge every claim into mush. give me the spicy take with the receipts. more please.

    Jun 15, 2026·1 day later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email