Requests and Limits: The Two Numbers That Decide Your Bill and Your Latency

Get requests and limits wrong and you either waste half your cluster or get pods OOM-killed and CPU-throttled at the worst moment. A practical guide to setting them.

May 28, 202610 min readKubernetesScaling

Every container in Kubernetes can declare resource requests (what it's guaranteed) and limits (what it can't exceed). These two numbers quietly decide where pods get scheduled, which ones get killed under pressure, and how much of your cloud bill is paying for idle headroom. Most teams set them by guessing once and never revisiting.

1. Requests are for scheduling; limits are for protection

The scheduler places a pod only on a node with enough requested capacity free — requests are a reservation, whether or not you use it. Limits are the ceiling the kubelet enforces at runtime. Set requests too high and you reserve a half-empty cluster; too low and you over-pack nodes and starve under load.

resources:
  requests: { cpu: "250m", memory: "256Mi" }   # guaranteed, used for scheduling
  limits:   { cpu: "1",    memory: "512Mi" }   # hard ceiling

2. CPU and memory behave completely differently at the limit

This is the part that bites people. CPU is compressible: hit the limit and you get throttled (slowed), not killed. Memory is incompressible: exceed the limit and the container is OOM-killed. So an aggressive CPU limit costs you tail latency; an aggressive memory limit costs you a crash loop.

3. QoS classes decide who dies first

Kubernetes derives a Quality-of-Service class from your settings, and it determines eviction order when a node runs out of memory:

Guaranteed — requests == limits for all resources. Last to be evicted.
Burstable — requests < limits. Evicted after BestEffort.
BestEffort — nothing set. First against the wall.

Critical workloads should be Guaranteed. Never run important pods as BestEffort.

4. The CPU-limit controversy

Setting requests but no CPU limit is a defensible, popular choice: pods can burst into idle capacity, and you avoid throttling latency-sensitive services on CPUs they could have used. Memory limits, by contrast, you almost always want — to stop one leaking pod from taking down the node. A reasonable default: always set memory request == limit; set CPU request, leave CPU limit off unless you need hard multi-tenancy.

5. Right-size from real data

Don't guess. Read actual usage from metrics (the Vertical Pod Autoscaler in recommendation mode will suggest values), set requests near the p95 of real usage, add headroom, and revisit after load changes. Over-requesting is the single biggest source of "why is our cluster 30% utilised and still expensive?"

Rules of thumb

Requests = scheduling reservation; limits = runtime ceiling. They are not the same knob.
Memory over limit = OOM-kill; CPU over limit = throttle. Treat them differently.
Set memory request == limit for important pods (Guaranteed QoS). Consider leaving CPU limit off.
Size requests from p95 real usage plus headroom, not from a first-day guess.
Never run anything you care about as BestEffort.

Requests and Limits: The Two Numbers That Decide Your Bill and Your Latency

1. Requests are for scheduling; limits are for protection

2. CPU and memory behave completely differently at the limit

3. QoS classes decide who dies first

4. The CPU-limit controversy

5. Right-size from real data

Rules of thumb

7 replies// weighed in

More from this topic

Kubernetes For People Who Already Hate Kubernetes

The Production Kubernetes Checklist: 11 Things That Will Bite You

Liveness, Readiness, and Startup Probes: The Three Health Checks and How They Differ