ML
Kubernetes

Kubernetes For People Who Already Hate Kubernetes

If you've ever stared at a 200-line YAML and wondered who hurt the people who designed this — this is for you. The mental model that makes K8s actually click, in plain language.

March 04, 202611 min readKubernetesArchitecture

I've onboarded probably twenty engineers onto Kubernetes over the years. The pattern is always the same: they hate it for two weeks, then something clicks, then they start defending it at parties. The difference between week one and week three is one piece of mental model. Not knowledge — model. This is that piece.

1. K8s is a control loop, not a deploy tool

Most engineers come to Kubernetes from "deploy tools" — Heroku, ECS, plain SSH + systemd. Those tools take an action: "deploy this version." Kubernetes is fundamentally different. You don't tell it to deploy. You tell it the state you want, and a control loop reconciles toward it, forever.

Every operation is kubectl apply — "I want this state" — even when the resource already exists. The reconciler is constantly answering: is what I see equal to what was declared? If not, do something to close the gap. Once you internalise this, half of Kubernetes' weirdness stops being weird:

  • You can't "start" a pod. You can declare a Deployment, and the controller starts the pod for you. If you delete the pod, it comes back — because the declaration didn't change.
  • Rolling updates aren't an action. They're a diff between old and new desired state, plus a strategy for closing it.
  • Failed pods don't "crash" the deploy. They're just signals to the controller that the desired state isn't yet reached.

2. The five primitives that explain 80% of K8s

You don't need to learn all of K8s. Five resources cover most production work. Master these and the rest is reading the docs:

  1. Pod. One or more containers that share a network namespace and run together. The atomic unit of work. You almost never create one directly.
  2. Deployment. "I want N copies of this pod, please keep them alive and roll updates safely." The 90% case for stateless services.
  3. Service. A stable virtual IP and DNS name in front of a set of pods, selected by labels. The primitive that solves "how do other things in the cluster reach my app."
  4. ConfigMap / Secret. Key-value stores you mount into pods as env vars or files. ConfigMap for normal config, Secret for credentials.
  5. Ingress. The external entry point. "Route this hostname/path on port 80/443 to that Service." The way the outside world finds your app.

That's it. Stateful apps need StatefulSet, jobs need Job/CronJob, custom logic needs Operators — but those are extensions of this same model.

3. Labels and selectors: the secret glue

Most beginners think K8s couples resources by name. It doesn't. It couples them by labels. A Deployment doesn't say "create pods with names app-1, app-2." It says "there should be 3 pods matching this label selector." The Service in front says "send traffic to all pods matching this label selector."

# Deployment template
metadata:
  labels:
    app: orders
    tier: api
# Service selector
selector:
  app: orders
  tier: api

This is why labels feel weirdly load-bearing. They're not metadata — they're the actual primary key the system uses to wire things up. Mess up a label, and the Service can't find pods that obviously exist five feet away.

4. The lifecycle nobody explains

When you kubectl apply a Deployment, here's what actually happens:

  1. The API server validates the YAML and writes it to etcd.
  2. The Deployment controller wakes up, sees the diff, and creates/updates a ReplicaSet.
  3. The ReplicaSet controller sees it doesn't have N pods matching its selector, and creates pods.
  4. The scheduler picks a node for each new pod based on resource fit, taints, affinities.
  5. The kubelet on that node pulls the image and starts the container.
  6. The Endpoints controller updates the Service's endpoint list to include the new pod IPs.
  7. kube-proxy on every node updates iptables (or IPVS) rules so traffic to the Service IP routes to the right pods.

This is a lot of moving parts. The thing that makes it feel "slow" sometimes is that all seven steps are happening asynchronously, in different processes, across multiple nodes, all reconciling toward the desired state. When something looks broken, work backward through the chain — usually one of these steps is failing silently.

5. The good parts

Once you accept the model, K8s gives you things that are genuinely hard to build:

  • Self-healing. Pod dies? Controller restarts it. Node dies? Pods get rescheduled.
  • Rolling updates with health checks built in. Bad deploy doesn't take down all replicas.
  • Horizontal autoscaling on CPU, memory, or custom metrics — declaratively.
  • Service discovery without an extra system. http://orders Just Works inside the cluster.
  • A consistent API surface across cloud providers. "Move from AWS to GCP" goes from a quarter-long project to a week.

Are these things worth the YAML? For a single service running 5 RPS, almost certainly not — use Heroku, Fly, Render, or systemd. For a team running 30 services with real availability requirements, the answer flips. The break-even is somewhere around 5–10 production services and 5+ engineers.

6. The honest list of what still sucks

  • YAML. We could have had a saner config language. We didn't get one.
  • Networking is two abstractions away from anything you can debug with tcpdump alone.
  • Helm is necessary and also frequently the worst part of a stack.
  • The error messages are written by people who already understood the system. Not for people learning it.
  • The default behaviour is rarely the production-correct behaviour. Resource limits, liveness probes, PDBs — all opt-in, all important.

The thing that finally clicks

Kubernetes is a database with a runtime attached. The database stores your desired state. The runtime is a swarm of controllers fighting to make reality match the database. Every problem in K8s is one of three things: the database has the wrong state, a controller failed, or your reality doesn't match because of resources you can't see (RBAC, taints, network policies).

That's the whole game. Once you stop reading it as "a deploy tool with extra steps" and start reading it as "a state-reconciliation engine with deploys as a side effect," the YAML stops feeling absurd. Or — fine — it still feels a bit absurd. But you stop being mad about it.

SharePostLinkedIn

Reader Discussion

6 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Anders Lindqvist· Staff SREStory

    the preStop sleep trick. THE preStop sleep trick. we spent 3 days debugging mystery 5xx during deploys and the answer was a 10-second sleep. there should be a billboard.

    Mar 05, 2026·1 day later
  2. Jiwoo Park· Junior EngineerKind words

    the "control loop, not deploy tool" framing finally made k8s click for me. been fighting with it for 4 months. wish onboarding docs led with this paragraph instead of YAML.

    Mar 06, 2026·2 days later
  3. Vasili Kurov· Platform EngineerFrom experience

    HPA on CPU only is the silent killer. moved ours to requests-per-pod via prometheus-adapter and our scaling went from "vaguely correct" to actually correlated with load. 2 hours of work, immediate ROI.

    Mar 08, 2026·4 days later
  4. Tiến Hồ🇻🇳 Hà Nội· DevOps EngineerAgrees

    PDB = thứ 90% team mình bỏ qua đến khi GKE node upgrade làm 3 pod down một lúc. minAvailable: 2 cộng với replicas: 3 là default mình deploy bây giờ, không cần suy nghĩ.

    Mar 07, 2026·3 days later
  5. Priscilla Owens· Backend LeadPushback

    small pushback — "never set CPU limits" is too strong imo. on cgroups v2 with steady traffic profiles, soft caps prevent one noisy neighbour from starving the whole node. it's a per-workload call. great post otherwise.

    Mar 11, 2026·1 week later·edited
  6. Léa Dubois· SREAsks

    any chance you'd publish these as a PDF collection? would love to print and read offline on flights. screen-fatigue is real.

    Mar 10, 2026·6 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email