ML
Scaling

Running Out of Ports: Ephemeral Port Exhaustion Under Outbound Load

Under a load test our service started throwing 'cannot assign requested address' even though every machine had spare CPU, memory, and bandwidth. We were not out of capacity in any way a dashboard tracks. We were out of source ports, because every outbound request opened a fresh connection and left a pile of sockets sitting in TIME_WAIT.

June 27, 20268 min readNetworkingScaling

We were load testing an API gateway that fans each incoming request out to a couple of internal services. Throughput climbed nicely, then plateaued, then the errors started: a flood of EADDRNOTAVAIL, "cannot assign requested address". The strange part was that nothing looked busy. CPU was comfortable, memory was fine, the network was nowhere near saturated, and the downstream services were healthy and bored. By every metric we normally watch, the machine had plenty left to give. It was failing anyway, and it was failing on a resource almost nobody has a dashboard for: the local source port.

Every outbound connection burns a source port

A TCP connection is identified by four things: source IP, source port, destination IP, and destination port. When your service dials a downstream service, the destination IP and port are fixed (that service's address), and your source IP is fixed (your machine). The only part the kernel gets to vary to make each connection unique is the source port, and it picks one from a bounded range called the ephemeral port range. On Linux that range is commonly about 28000 ports by default, 32768 to 60999.

So to a single destination, there is a hard ceiling of roughly 28000 simultaneous outbound connections from one source IP, and once those ports are all in use the kernel literally cannot form another connection. That is the "cannot assign requested address" error. It is not the remote rejecting you, it is your own machine unable to find a free source port for the new socket. We assumed connection capacity was about CPU or the remote's limits. It was actually about a small integer range of local ports.

TIME_WAIT is what makes it bite so early

Here is the twist that turns "28000 connections" into "we fell over at a few thousand requests per second". A port is not freed the instant you close the connection. The side that closes a TCP connection first leaves it in the TIME_WAIT state, and the socket sits there holding its source port for a fixed timeout, conventionally 2 * MSL, which on Linux is 60 seconds. This is not a bug, it is TCP doing its job: TIME_WAIT exists so that any delayed packets from the old connection drain out of the network before that same four-tuple can be reused, otherwise a straggler from a dead connection could be mistaken for data on a new one.

The consequence under load is brutal. If your client opens a new connection for every request and closes it, then every request parks a source port in TIME_WAIT for 60 seconds. At a steady 600 requests per second to one destination, that is 600 times 60, about 36000 ports tied up at once, which is already more than the entire ephemeral range. You exhaust the ports long before you exhaust anything that shows up on a normal dashboard, and the throughput at which it happens is far lower than the raw port count suggests.

# a machine drowning in TIME_WAIT to one destination
$ ss -tan state time-wait | wc -l
31044
$ ss -tan state time-wait dst 10.0.3.7:8080 | head -3
TIME-WAIT  0  0   10.0.2.5:41002   10.0.3.7:8080
TIME-WAIT  0  0   10.0.2.5:41003   10.0.3.7:8080
TIME-WAIT  0  0   10.0.2.5:41006   10.0.3.7:8080

The real fix: stop opening a connection per request

The root cause was not the port range being too small, it was that we were churning connections. Each downstream call did a fresh TCP handshake, made one request, and closed, which is the worst possible pattern. The fix is connection reuse: keep a pool of keep-alive connections to each downstream and send many requests over each one. A reused connection does not allocate a new source port and never enters TIME_WAIT between requests, so a few hundred pooled connections can carry tens of thousands of requests per second without touching the port ceiling.

The catch is that connection reuse only happens if every layer is configured for it, and the defaults often are not. In Node.js a plain http.request uses an agent that, depending on version, may not keep connections alive or may cap the pool low, so you have to set up an agent deliberately.

// reuse connections instead of opening one per request
const agent = new http.Agent({
  keepAlive: true,
  maxSockets: 128,        // cap the pool per host
  maxFreeSockets: 64,
});
http.request({ host, port, agent }, ...);

The same trap exists everywhere: many HTTP client libraries return a new connection each call unless you reuse a single client instance, and database and Redis access should go through a long-lived pool for the identical reason. If you create the client inside your request handler, you are opening a connection per request no matter what the library supports. Build the pool once at startup and share it.

The knobs that look like fixes, and the one that actually helps

When you search this error you will find advice to widen the ephemeral range and to enable tcp_tw_reuse. Widening the range with net.ipv4.ip_local_port_range does buy headroom, taking you from roughly 28000 ports toward 60000, and it is worth doing, but it only raises the ceiling, it does not stop you climbing toward it. If your real problem is connection churn, a bigger range just delays the same wall.

# more headroom, not a cure
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# let the kernel reuse TIME_WAIT sockets for new outbound connections
sysctl -w net.ipv4.tcp_tw_reuse=1

tcp_tw_reuse is the genuinely useful knob: it lets the kernel reuse a TIME_WAIT socket for a new outbound connection when timestamps make it safe, which is exactly our case of a client making many connections. Note it applies to outbound connections, and that the old tcp_tw_recycle option, which people confuse it with, was dangerous behind NAT and has been removed from modern kernels, so do not reach for that one. Treat these settings as a safety margin layered on top of the real fix, not as the fix. The fix is reusing connections.

Why it hid until the load test

At normal traffic you never come close to the port ceiling, so a connection-per-request pattern works perfectly for years. It only breaks past a threshold of sustained outbound requests to a single destination, which is why it surfaced under a load test and not in production, and why it is so disorienting when it hits: the symptom is a connection failure, but the cause is a slow accumulation of harmless-looking closed sockets. If you ever see EADDRNOTAVAIL or "cannot assign requested address" while the machine is otherwise idle, do not look at CPU or the remote service. Count your sockets in TIME_WAIT to one destination, and you will almost always find the answer there.

Rules of thumb

  • Outbound connections to a single destination are capped by the ephemeral source-port range, roughly 28000 ports by default. Exhaust it and you get "cannot assign requested address", from your own kernel, not the remote.
  • Closing a connection does not free its port immediately. The port sits in TIME_WAIT for about 60 seconds, so port usage at any moment is your connection rate times that window, which exhausts the range far below the raw count.
  • The root fix is connection reuse: keep-alive pools so many requests share each connection, instead of one connection per request.
  • Reuse only works if you share a long-lived client or agent. Creating the client inside the request handler opens a connection per request regardless of library support.
  • Widening ip_local_port_range adds headroom but only delays the wall if connections still churn.
  • Enable tcp_tw_reuse for safe reuse of outbound TIME_WAIT sockets. Never use the removed tcp_tw_recycle, which broke clients behind NAT.
  • When connections fail on an otherwise idle box, count TIME_WAIT sockets to the destination with ss before suspecting CPU or the remote.
SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Evan Whitfield· Eng DirectorStory

    "scale vertical until it becomes awkward" — yes. We sunk SIX MONTHS into a sharding project that a $4k/month bigger box would have made unnecessary for two more years. Premature distribution is the bourgeois cousin of premature optimisation.

    Jun 29, 2026·2 days later
  2. Monique Laurent· Principal EngineerFrom experience

    cell-based architecture is how you sleep at night past a certain scale. blast-radius math is so much friendlier when one cell going down hurts 1/N customers instead of all of them. it's also 2x ops cost — pick your poison.

    Jul 03, 2026·6 days later
  3. Quốc Anh🇸🇬 SG· Cloud ArchitectAgrees

    PACELC > CAP. Once you accept that partitions aren't the only trade-off (latency-vs-consistency happens 24/7, partitions are a side quest), the whole topology debate gets clearer. Wish PACELC had CAP's branding budget.

    Jul 01, 2026·4 days later
  4. Grace Liu· Senior EngineerFrom experience

    idempotency keys are such a small API change for such a huge reliability win. ship them on every new POST endpoint as a default. the day you need them and don't have them is one of the worst days of your career.

    Jun 30, 2026·3 days later
  5. Kofi Mensah· Infra EngineerAsks

    Q: idempotency-key table — keep on primary or shard it? we hit ~120M rows/month and the index is bigger than the data. partition by month + retention helps but I'm curious what others do

    Jul 02, 2026·5 days later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    Jun 30, 2026·3 days later
  7. Omar Khalil· Senior SWEKind words

    this is the third article from this blog I've sent to my team this month. you're cooking. don't switch to crypto.

    Jul 02, 2026·5 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email