ML
Kubernetes

Requests and Limits: The Two Numbers That Decide Your Bill and Your Latency

Get requests and limits wrong and you either waste half your cluster or get pods OOM-killed and CPU-throttled at the worst moment. A practical guide to setting them.

May 28, 202610 min readKubernetesScaling

Every container in Kubernetes can declare resource requests (what it's guaranteed) and limits (what it can't exceed). These two numbers quietly decide where pods get scheduled, which ones get killed under pressure, and how much of your cloud bill is paying for idle headroom. Most teams set them by guessing once and never revisiting.

1. Requests are for scheduling; limits are for protection

The scheduler places a pod only on a node with enough requested capacity free — requests are a reservation, whether or not you use it. Limits are the ceiling the kubelet enforces at runtime. Set requests too high and you reserve a half-empty cluster; too low and you over-pack nodes and starve under load.

resources:
  requests: { cpu: "250m", memory: "256Mi" }   # guaranteed, used for scheduling
  limits:   { cpu: "1",    memory: "512Mi" }   # hard ceiling

2. CPU and memory behave completely differently at the limit

This is the part that bites people. CPU is compressible: hit the limit and you get throttled (slowed), not killed. Memory is incompressible: exceed the limit and the container is OOM-killed. So an aggressive CPU limit costs you tail latency; an aggressive memory limit costs you a crash loop.

3. QoS classes decide who dies first

Kubernetes derives a Quality-of-Service class from your settings, and it determines eviction order when a node runs out of memory:

  • Guaranteed — requests == limits for all resources. Last to be evicted.
  • Burstable — requests < limits. Evicted after BestEffort.
  • BestEffort — nothing set. First against the wall.

Critical workloads should be Guaranteed. Never run important pods as BestEffort.

4. The CPU-limit controversy

Setting requests but no CPU limit is a defensible, popular choice: pods can burst into idle capacity, and you avoid throttling latency-sensitive services on CPUs they could have used. Memory limits, by contrast, you almost always want — to stop one leaking pod from taking down the node. A reasonable default: always set memory request == limit; set CPU request, leave CPU limit off unless you need hard multi-tenancy.

5. Right-size from real data

Don't guess. Read actual usage from metrics (the Vertical Pod Autoscaler in recommendation mode will suggest values), set requests near the p95 of real usage, add headroom, and revisit after load changes. Over-requesting is the single biggest source of "why is our cluster 30% utilised and still expensive?"

Rules of thumb

  • Requests = scheduling reservation; limits = runtime ceiling. They are not the same knob.
  • Memory over limit = OOM-kill; CPU over limit = throttle. Treat them differently.
  • Set memory request == limit for important pods (Guaranteed QoS). Consider leaving CPU limit off.
  • Size requests from p95 real usage plus headroom, not from a first-day guess.
  • Never run anything you care about as BestEffort.
SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Anders Lindqvist· Staff SREStory

    the preStop sleep trick. THE preStop sleep trick. we spent 3 days debugging mystery 5xx during deploys and the answer was a 10-second sleep. there should be a billboard.

    May 29, 2026·1 day later
  2. Priscilla Owens· Backend LeadPushback

    small pushback — "never set CPU limits" is too strong imo. on cgroups v2 with steady traffic profiles, soft caps prevent one noisy neighbour from starving the whole node. it's a per-workload call. great post otherwise.

    Jun 04, 2026·1 week later·edited
  3. Jiwoo Park· Junior EngineerKind words

    the "control loop, not deploy tool" framing finally made k8s click for me. been fighting with it for 4 months. wish onboarding docs led with this paragraph instead of YAML.

    May 30, 2026·2 days later
  4. Vasili Kurov· Platform EngineerFrom experience

    HPA on CPU only is the silent killer. moved ours to requests-per-pod via prometheus-adapter and our scaling went from "vaguely correct" to actually correlated with load. 2 hours of work, immediate ROI.

    Jun 01, 2026·4 days later
  5. Tiến Hồ🇻🇳 Hà Nội· DevOps EngineerAgrees

    PDB = thứ 90% team mình bỏ qua đến khi GKE node upgrade làm 3 pod down một lúc. minAvailable: 2 cộng với replicas: 3 là default mình deploy bây giờ, không cần suy nghĩ.

    May 31, 2026·3 days later
  6. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    May 30, 2026·2 days later
  7. Kenta Yamada· Tech LeadAsks

    would love a war-story follow-up. principles are clear; the actual debugging session is where the interesting stuff lives. there's a real shortage of "here's the dashboard, here's the thread we pulled, here's where we got stuck for 90 mins" content.

    Jun 01, 2026·4 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email