The Production Kubernetes Checklist: 11 Things That Will Bite You
The defaults work fine in dev. They will betray you in prod. Here's the checklist I run on every cluster before I let it hold real traffic.
The K8s defaults are designed for one thing: getting started. They're tuned for low cognitive load on a fresh cluster, not for a pod that's serving 4,000 RPS to paying customers at 3am. Every item on this list is something I've personally been bitten by, or watched a team get bitten by, in production.
1. No resource requests / limits
Without resources.requests, the scheduler thinks your pod needs zero. It'll happily pack it onto a node already 90% utilised. First traffic spike, the node OOMs, your pod gets killed, the next pod gets scheduled onto another packed node. Welcome to the cascade.
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 512Mi
# CPU limit: deliberate omission — see below
Set requests always. Set memory limits always. Be careful with CPU limits — CFS throttling on cgroups v1 has bitten many teams. On cgroups v2 it's better, but still: most of the time you want CPU requests for scheduling and no hard cap.
2. No liveness / readiness probes
Kubernetes will route traffic to a pod the moment its container starts — even if your app is still warming up the JVM, loading caches, or running migrations. Without a readiness probe, your first thirty seconds of pod startup are a 5xx generator.
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3
Two probes, two endpoints. ready tells the Service "send traffic." live tells the kubelet "if I keep failing this, restart me." Don't make them the same endpoint — readiness should be lightweight; liveness should reflect fundamental brokenness only.
3. No PodDisruptionBudget
You drained a node. Three pods of your service ran on it. They all moved at once. There were only three replicas. Your service was down for the eviction window. This is not a hypothetical — it happens any time anyone kubectl drains a node.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: orders-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: orders
PDB is the contract that voluntary disruptions (drains, scale-down) respect. Set minAvailable on every important Deployment. Don't set it equal to your replica count — you'll deadlock the scheduler on every drain.
4. The HPA on CPU only
The HPA's default metric is CPU utilization. For a Java service with a thread pool, CPU correlates poorly with how loaded the service actually is. I've seen HPAs sit at 30% CPU while the request queue was 4 minutes deep, because the bottleneck was a downstream DB connection, not CPU.
Use a meaningful business metric: requests-per-pod-per-second, queue depth, p95 latency. The Kubernetes HPA supports custom metrics via the metrics-server / Prometheus adapter. Worth the setup.
5. imagePullPolicy: Always on a tag like :latest
This is the single fastest way to take down a service the next time the registry has a hiccup. Every pod restart re-pulls the image. The registry is unavailable for 60 seconds during AWS Tuesday? Your whole fleet stops scheduling.
Always pin tags to immutable digests (image: foo@sha256:abc...) or content-addressed semver tags. Set imagePullPolicy: IfNotPresent for pinned tags.
6. Cluster autoscaler with no priority classes
Without priority classes, the cluster autoscaler can't tell the difference between your daily cron job and your customer-facing API. When capacity is tight, it'll evict whichever pod the scheduler picks. Sometimes that's your API.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: user-facing
value: 1000
globalDefault: false
Tag user-facing services with high priority, batch jobs with low. The autoscaler will prefer to evict the low-priority work first.
7. No NetworkPolicy
By default, every pod in a cluster can talk to every other pod. The compromised nginx-test pod that someone forgot about can talk to your customer database service. NetworkPolicy is the firewall you forgot to install.
Start with a default deny in each namespace and explicitly allow:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: prod
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
Then add per-service allow rules. Tedious to set up. Game-changing for security.
8. Logs writing only to stdout, no rotation strategy
Kubernetes captures stdout/stderr to disk on the node. The default rotation on most distributions is 10 MB × 5 files. A service writing 1 MB/sec of logs fills that in a minute and starts losing logs and filling the node disk.
Either: (a) write to stdout but log at sane volumes (don't log every request body), or (b) ship logs to a sidecar that streams them off-node. Don't fight the host disk.
9. Single-zone Deployment
By default, the scheduler will happily place all 5 of your replicas on nodes in the same availability zone — because that's the cheapest fit. AZ outages happen. Spread your replicas:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: orders
10. kube-system with no resource limits
An exhausted node will OOM-kill processes. The kube-system pods (kubelet helpers, CNI, kube-proxy) are also on that node. Without QoS guarantees, they get killed alongside your app, and the node goes NotReady. Suddenly the autoscaler is reacting to a node disappearing — for no reason except that it ran out of memory.
Use the system-cluster-critical priority class for kube-system pods, and keep some headroom on each node (around 10% of memory) reserved via --system-reserved kubelet flags.
11. No pre-stop hook
When a pod is terminated, the kubelet sends SIGTERM and waits terminationGracePeriodSeconds (default 30) before SIGKILL. But the Service endpoints can take 5–10 seconds to remove the pod's IP across all kube-proxies in the cluster. Pod's already exiting, traffic still arriving, 5xx for everyone.
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 30
Yes, a literal sleep. The pod stays alive for 10 seconds after being marked for termination, while endpoints propagate. Then your app gets SIGTERM and shuts down cleanly. This is the single fastest fix for "why does our deploy emit 5xx errors."
The honest summary
Production Kubernetes isn't a different system from dev Kubernetes. It's the same system with about a dozen knobs turned to a different position. Every item on this list is one of those knobs. None of them are documented as "required." Every one of them is required, in the sense that you'll regret skipping it eventually.
The default Kubernetes experience is friendly. The production-correct Kubernetes experience is opinionated. The gap between them is most of what makes K8s feel hard. It's also where the value lives.