ML
System Design

The TTL We Forgot: Why Traffic Kept Hitting a Server We'd Already Killed

We moved a service to a new host, repointed the DNS record, watched traffic shift over, and shut the old box down. Then the error rate climbed and stayed up for almost an hour, because a meaningful slice of clients were still sending requests to an IP that no longer answered. Nobody had ever looked at the record's TTL.

June 26, 20268 min readNetworkingDNS

The migration looked clean. New host provisioned, app deployed and healthy, the DNS A record for api.example.com updated to point at the new IP. We watched our dashboards, saw traffic appear on the new box, and after a few minutes we terminated the old one. Within seconds the error rate jumped, and it did not come back down for the better part of an hour. A steady trickle of clients kept trying to reach the IP of a server we had already deleted.

The cause was a number none of us had ever touched: the record's TTL. Ours was set to 3600, one hour, the provider default. That single value decided how long every cache between us and our clients would keep handing out the old address after we changed it.

"DNS propagation" is just cache expiry

People talk about DNS changes "propagating", as if an update pushes out to the world and you wait for it to arrive. That is not how it works, and the wrong mental model is exactly what bit us. DNS is pull, not push. When a resolver looks up your record, it gets the answer plus a TTL, and it is allowed to reuse that answer without asking again until the TTL elapses. Nothing notifies anyone when you change the record. The change only becomes visible to a given cache when that cache's copy expires and it asks again.

The consequence that catches people: lowering the TTL does not take effect until the old TTL has already expired. If your record has a one-hour TTL and you drop it to 60 seconds right before a migration, caches that fetched the record five minutes ago still hold the old answer, with the old one-hour TTL, for another 55 minutes. The low TTL only governs lookups that happen after the change is seen. You have to lower the TTL, then wait out the old one, before the short TTL is actually in force.

The caches that hold the old address

There is no single cache to wait on. The answer can be held at several layers, and the slowest one sets your real cutover time:

  • Recursive resolvers, the client's ISP or public resolver, cache for up to the TTL. Mostly well behaved, though some clamp very low TTLs up to a minimum of their own.
  • The operating system stub resolver and, on desktops, the browser keep their own short caches on top of that.
  • The application runtime is the one that really hurt us. Some runtimes cache DNS results independently of the record's TTL. The JVM is the classic example: depending on its security settings it can cache a successful lookup effectively forever, so a long-running service resolves your hostname once at startup and never looks again.
  • Connection pools and keep-alive sidestep DNS entirely after the first lookup. A pool resolves the hostname when it opens a connection and then pins that socket to the resolved IP. As long as the connection stays alive and gets reused, the client never re-resolves, no matter what the TTL says.

That last pair explains why killing the old server caused errors instead of a clean fallover. The clients still hitting it were not re-reading DNS at all. They had live connections to the old IP, or an IP cached in-process past any TTL, and they kept using it until those connections finally failed. The TTL governs how long new lookups keep a stale answer; it does nothing about clients that are not looking up at all.

Doing a cutover that actually drains

Lower the TTL ahead of time, not at the moment of change. Days before, drop the TTL to something small like 60 seconds, and wait at least the old TTL for that change to be in effect everywhere. Now, when you flip the record, caches turn over within a minute instead of an hour. Raise it back afterward.

Do not delete the old server when DNS flips. Drain it. Keep the old host serving until you can see that traffic to it has fallen to essentially zero, then shut it down. Decommission on the evidence of your own traffic graphs, not on a guess about how long caches take. If you cannot keep it fully serving, have it briefly redirect or proxy to the new host so stragglers still succeed.

Put a stable name in front of mutable backends. The cleanest fix is to not change DNS for a backend swap at all. Point clients at a load balancer, an anycast address, or a CNAME to an endpoint whose IP you control and keep, and move backends behind that fixed front. The address clients cache never changes; you reroute on your side, where you have real control over draining, instead of at the DNS layer, where you have none.

Tame the app-layer caches. Set your runtime's DNS cache TTL to something sane (for the JVM, configure networkaddress.cache.ttl rather than leaving it to cache forever), bound how long pooled connections live so they periodically reopen and re-resolve, and cap keep-alive lifetimes for the same reason.

Low TTL is a tool, not a default

It is tempting to conclude "just always use a 30 second TTL". Do not. Every time a cache expires, the next request pays for a fresh recursive lookup before it can even connect, and a very low TTL means that happens constantly, adding latency to requests and load to your DNS. TTLs exist to trade staleness for speed and resilience. The right move is to lower the TTL temporarily around a planned change, then restore a healthy value, and for routine backend changes to avoid touching DNS at all by hiding your servers behind a stable name.

Rules of thumb

  • DNS is pull-based. A change becomes visible only when each cache's TTL expires, not when you save the record. "Propagation" is just expiry.
  • Lowering the TTL only helps after the previous TTL has elapsed. Lower it well before a cutover, then wait out the old value.
  • Several layers cache the answer: recursive resolvers, the OS, the browser, and especially the application runtime and connection pools, which can ignore the TTL entirely.
  • Keep-alive connections and pooled sockets pin an IP and never re-resolve, so a killed server keeps getting traffic regardless of TTL.
  • Never decommission the old host on a timer. Drain it to near-zero traffic by observation, then shut it down.
  • The durable fix is a stable name (load balancer, anycast, CNAME) in front of changeable backends, so a backend swap never requires a DNS change.
SharePostLinkedIn

Reader Discussion

2 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Léa Dubois· SREAsks

    any chance you'd publish these as a PDF collection? would love to print and read offline on flights. screen-fatigue is real.

    Jul 02, 2026·6 days later
  2. Ahmed Rahman· Full StackKind words

    concise + opinionated = my favourite kind of engineering post. so many blogs hedge every claim into mush. give me the spicy take with the receipts. more please.

    Jun 27, 2026·1 day later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email