ML
Kubernetes

The Production Kubernetes Checklist: 11 Things That Will Bite You

The defaults work fine in dev. They will betray you in prod. Here's the checklist I run on every cluster before I let it hold real traffic.

March 19, 202612 min readKubernetesReliabilityProduction

The K8s defaults are designed for one thing: getting started. They're tuned for low cognitive load on a fresh cluster, not for a pod that's serving 4,000 RPS to paying customers at 3am. Every item on this list is something I've personally been bitten by, or watched a team get bitten by, in production.

1. No resource requests / limits

Without resources.requests, the scheduler thinks your pod needs zero. It'll happily pack it onto a node already 90% utilised. First traffic spike, the node OOMs, your pod gets killed, the next pod gets scheduled onto another packed node. Welcome to the cascade.

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    memory: 512Mi
    # CPU limit: deliberate omission — see below

Set requests always. Set memory limits always. Be careful with CPU limits — CFS throttling on cgroups v1 has bitten many teams. On cgroups v2 it's better, but still: most of the time you want CPU requests for scheduling and no hard cap.

2. No liveness / readiness probes

Kubernetes will route traffic to a pod the moment its container starts — even if your app is still warming up the JVM, loading caches, or running migrations. Without a readiness probe, your first thirty seconds of pod startup are a 5xx generator.

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 15
  failureThreshold: 3

Two probes, two endpoints. ready tells the Service "send traffic." live tells the kubelet "if I keep failing this, restart me." Don't make them the same endpoint — readiness should be lightweight; liveness should reflect fundamental brokenness only.

3. No PodDisruptionBudget

You drained a node. Three pods of your service ran on it. They all moved at once. There were only three replicas. Your service was down for the eviction window. This is not a hypothetical — it happens any time anyone kubectl drains a node.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: orders-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: orders

PDB is the contract that voluntary disruptions (drains, scale-down) respect. Set minAvailable on every important Deployment. Don't set it equal to your replica count — you'll deadlock the scheduler on every drain.

4. The HPA on CPU only

The HPA's default metric is CPU utilization. For a Java service with a thread pool, CPU correlates poorly with how loaded the service actually is. I've seen HPAs sit at 30% CPU while the request queue was 4 minutes deep, because the bottleneck was a downstream DB connection, not CPU.

Use a meaningful business metric: requests-per-pod-per-second, queue depth, p95 latency. The Kubernetes HPA supports custom metrics via the metrics-server / Prometheus adapter. Worth the setup.

5. imagePullPolicy: Always on a tag like :latest

This is the single fastest way to take down a service the next time the registry has a hiccup. Every pod restart re-pulls the image. The registry is unavailable for 60 seconds during AWS Tuesday? Your whole fleet stops scheduling.

Always pin tags to immutable digests (image: foo@sha256:abc...) or content-addressed semver tags. Set imagePullPolicy: IfNotPresent for pinned tags.

6. Cluster autoscaler with no priority classes

Without priority classes, the cluster autoscaler can't tell the difference between your daily cron job and your customer-facing API. When capacity is tight, it'll evict whichever pod the scheduler picks. Sometimes that's your API.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: user-facing
value: 1000
globalDefault: false

Tag user-facing services with high priority, batch jobs with low. The autoscaler will prefer to evict the low-priority work first.

7. No NetworkPolicy

By default, every pod in a cluster can talk to every other pod. The compromised nginx-test pod that someone forgot about can talk to your customer database service. NetworkPolicy is the firewall you forgot to install.

Start with a default deny in each namespace and explicitly allow:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: prod
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

Then add per-service allow rules. Tedious to set up. Game-changing for security.

8. Logs writing only to stdout, no rotation strategy

Kubernetes captures stdout/stderr to disk on the node. The default rotation on most distributions is 10 MB × 5 files. A service writing 1 MB/sec of logs fills that in a minute and starts losing logs and filling the node disk.

Either: (a) write to stdout but log at sane volumes (don't log every request body), or (b) ship logs to a sidecar that streams them off-node. Don't fight the host disk.

9. Single-zone Deployment

By default, the scheduler will happily place all 5 of your replicas on nodes in the same availability zone — because that's the cheapest fit. AZ outages happen. Spread your replicas:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: orders

10. kube-system with no resource limits

An exhausted node will OOM-kill processes. The kube-system pods (kubelet helpers, CNI, kube-proxy) are also on that node. Without QoS guarantees, they get killed alongside your app, and the node goes NotReady. Suddenly the autoscaler is reacting to a node disappearing — for no reason except that it ran out of memory.

Use the system-cluster-critical priority class for kube-system pods, and keep some headroom on each node (around 10% of memory) reserved via --system-reserved kubelet flags.

11. No pre-stop hook

When a pod is terminated, the kubelet sends SIGTERM and waits terminationGracePeriodSeconds (default 30) before SIGKILL. But the Service endpoints can take 5–10 seconds to remove the pod's IP across all kube-proxies in the cluster. Pod's already exiting, traffic still arriving, 5xx for everyone.

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 30

Yes, a literal sleep. The pod stays alive for 10 seconds after being marked for termination, while endpoints propagate. Then your app gets SIGTERM and shuts down cleanly. This is the single fastest fix for "why does our deploy emit 5xx errors."

The honest summary

Production Kubernetes isn't a different system from dev Kubernetes. It's the same system with about a dozen knobs turned to a different position. Every item on this list is one of those knobs. None of them are documented as "required." Every one of them is required, in the sense that you'll regret skipping it eventually.

The default Kubernetes experience is friendly. The production-correct Kubernetes experience is opinionated. The gap between them is most of what makes K8s feel hard. It's also where the value lives.

SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Anders Lindqvist· Staff SREStory

    the preStop sleep trick. THE preStop sleep trick. we spent 3 days debugging mystery 5xx during deploys and the answer was a 10-second sleep. there should be a billboard.

    Mar 20, 2026·1 day later
  2. Vasili Kurov· Platform EngineerFrom experience

    HPA on CPU only is the silent killer. moved ours to requests-per-pod via prometheus-adapter and our scaling went from "vaguely correct" to actually correlated with load. 2 hours of work, immediate ROI.

    Mar 23, 2026·4 days later
  3. Tiến Hồ🇻🇳 Hà Nội· DevOps EngineerAgrees

    PDB = thứ 90% team mình bỏ qua đến khi GKE node upgrade làm 3 pod down một lúc. minAvailable: 2 cộng với replicas: 3 là default mình deploy bây giờ, không cần suy nghĩ.

    Mar 22, 2026·3 days later
  4. Priscilla Owens· Backend LeadPushback

    small pushback — "never set CPU limits" is too strong imo. on cgroups v2 with steady traffic profiles, soft caps prevent one noisy neighbour from starving the whole node. it's a per-workload call. great post otherwise.

    Mar 26, 2026·1 week later·edited
  5. Jiwoo Park· Junior EngineerKind words

    the "control loop, not deploy tool" framing finally made k8s click for me. been fighting with it for 4 months. wish onboarding docs led with this paragraph instead of YAML.

    Mar 21, 2026·2 days later
  6. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    Mar 21, 2026·2 days later
  7. Kenta Yamada· Tech LeadAsks

    would love a war-story follow-up. principles are clear; the actual debugging session is where the interesting stuff lives. there's a real shortage of "here's the dashboard, here's the thread we pulled, here's where we got stuck for 90 mins" content.

    Mar 23, 2026·4 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email