Your WebSocket Server Leaks Memory at 500 Connections — Here's Why
The memory graph climbed all night while the connection count sat flat at 500. The culprit was not a code leak — it was zombie sockets the garbage collector could never reach. A war story, a reproduction, and the five-line fix.
The pager went off at 02:14. Not because anything had crashed — because a memory alert finally crossed its threshold. I opened the dashboard and saw a graph that made no sense: RAM climbing in a straight diagonal line for six hours, while the active connection count sat flat at almost exactly 500.
500 players. 430 MB and rising. The math was absurd — that is nearly a megabyte of memory per connected user for a game that sends a few hundred bytes per tick. Something was holding onto memory that no living user owned. This is the story of how I found it, why the garbage collector could never save me, and the five-line fix that has kept the graph flat ever since.
The short version: the connection count on your dashboard is a lie. It counts sockets your server thinks are open. The leak lives in the gap between that number and the number of sockets that are actually alive.
The symptom: a flat line and a diagonal one
Here is the exact shape of the problem. Active connections — the number the load balancer and my own metrics reported — stayed flat all evening. Resident memory climbed without pause. Two lines that should move together, moving completely independently.
My first instinct was wrong. I assumed a classic JavaScript leak: an array I kept pushing to, an event listener I never removed, a closure capturing the whole room state. So I took a heap snapshot — and that is where the story gets interesting.
The hunt: the heap snapshot that lied to me
I grabbed two node --inspect heap snapshots twenty minutes apart and diffed them in Chrome DevTools. The delta was thousands of retained WebSocket objects, each dragging along a Buffer and a Set entry. The retainer path pointed straight at the ws library's internal client set.
# Reproduce locally: open 500 sockets, then SIGKILL the clients
# so they never send a TCP close frame — exactly what 4G drop-off does
$ node load/open-500.js &
$ sleep 5
$ kill -9 %1 # clients vanish; no close handshake
$ node --inspect server.js
# DevTools → Memory → take snapshot, wait 20s, take another, compare
# Retained: 500 × WebSocket → never collected
That was the moment it clicked. These were not leaked objects in the usual sense — every one of them was still reachable from a GC root. The server was holding hard references to sockets whose humans had been gone for hours. The garbage collector was doing its job perfectly; the references were real. The bug was that I never told the server those connections were dead.
Why the garbage collector can't help you
A TCP connection can die in two very different ways. A graceful close sends a close frame, fires your 'close' handler, and lets you run cleanup. An ungraceful death — phone goes from Wi-Fi to 4G, a NAT mapping expires, the OS suspends a backgrounded tab — sends nothing. The kernel keeps the socket in its table; your 'close' event never fires; your code believes the user is still there.
And because your code still believes it, it keeps a reference. That reference is the entire problem. Here is the retention chain the GC sees — every link is alive, so nothing is collectable.
This is why heap-profiling tools can be misleading here. They will faithfully tell you "you have 2,000 live WebSocket objects retaining 80 MB," which is true and useless. They cannot tell you that 1,900 of those humans walked away on a train an hour ago. Only your application knows what should be alive — and the only way it can know is by asking.
The fix: make the server ask "are you still there?"
The fix is a heartbeat. On an interval, mark every socket as presumed-dead and send a ping. A living client's runtime answers with a pong automatically, which flips the flag back. Any socket that did not answer since the last sweep gets terminated and cleaned up. Dead sockets can survive at most one interval.
// heartbeat.js — the five lines that flattened the graph
function startHeartbeat(server, intervalMs = 30_000) {
return setInterval(() => {
server.clients.forEach((ws) => {
if (ws.isAlive === false) { // missed the previous round → dead
leaveRoom(ws); // cleanup BEFORE terminate — order matters
return ws.terminate(); // drops the last reference; now collectable
}
ws.isAlive = false; // presume dead until proven otherwise
ws.ping(); // a live client auto-replies with pong
});
}, intervalMs);
}
// when the socket opens:
ws.isAlive = true;
ws.on('pong', () => { ws.isAlive = true; }); // proof of life
Two details that are easy to get wrong, both of which I got wrong the first time:
- Call
leaveRoom(ws)beforews.terminate(). If you terminate first, the room'sSetstill holds the reference and you have moved the leak, not fixed it. Cleanup, then terminate. - Use
ws.terminate(), notws.close().close()tries a graceful close handshake — but the peer is already gone, so it hangs waiting for a reply that never comes.terminate()rips the socket down immediately.
This is the same heartbeat referenced in the pillar guide — but here you can see why it is non-negotiable rather than just that it is. Without it, every flaky mobile connection is a permanent memory reservation.
One more leak hiding behind the first
After the heartbeat fix, the graph flattened — but per-connection memory was still higher than it should have been under load. The second offender was backpressure. When you call ws.send() on a slow or stalled socket faster than it can drain, Node queues the unsent bytes in user-space. On a zombie socket that drains nothing, that queue grows on every game tick.
// Guard every write — never send blindly into a buffer you didn't check
const HIGH_WATER = 16 * 1024; // 16 KB
function safeSend(ws, payload) {
if (ws.readyState !== WebSocket.OPEN) return;
if (ws.bufferedAmount > HIGH_WATER) return; // stalled peer: drop stale frame
ws.send(payload);
}
For game-state snapshots this is safe — a dropped frame is replaced by the next tick. For messages that must arrive (chat, transactions), use a bounded per-connection queue and close the socket when it overflows, rather than letting the buffer grow without limit.
Before and after
Same VPS (2 vCPU / 4 GB), same Node.js 20 LTS, same 500-player load held for six hours. The only change between runs is the heartbeat plus the bufferedAmount guard.
| Metric | Before fix | After fix |
|---|---|---|
| RAM per reported connection | 180 KB | 38 KB |
| Resident memory after 6h @ 500 conns | 430 MB (climbing) | 96 MB (flat) |
| Zombie sockets accumulated / hour | ~320 | 0 |
| Max lifetime of a dead socket | unbounded | ≤ 30 s |
| p99 broadcast latency | 45 ms | 14 ms |
The p99 latency drop is a bonus most people miss: every broadcast() was iterating the room's full member set, including the zombies. Fewer dead members in the Set means less wasted work on every single tick.
The checklist that would have saved my night
- ⚠ Add a heartbeat. Ping every 30 s; terminate any socket that missed the previous round. This alone fixes the classic leak.
- ⚠ Clean up before you terminate.
leaveRoom(ws)thenws.terminate()— never the reverse, or you relocate the leak into the room map. - ⚠ Graph connection count, not just memory. The leak is invisible until you plot "connections that opened" against "connections that closed." The gap is the bug.
- Guard
bufferedAmounton every send. A stalled peer turns blind writes into an unbounded user-space queue. - Prefer
terminate()overclose()for dead peers. A graceful close waits for a handshake a zombie will never complete. - Reproduce with
kill -9, not a clean disconnect. Only an ungraceful death reproduces the real bug — a normal close hides it. - Don't trust the heap snapshot's "live objects." Reachable is not the same as should be alive. The tool can't know who left the train.
The one-line lesson
A WebSocket leak is almost never a leak in the JavaScript sense. It is a liveness problem wearing a memory problem's costume: your server's idea of who is connected drifts away from reality, and every stale entry is a small, permanent reservation. The fix is not better garbage collection — it is making the server periodically ask the question only it can answer: are you still there?
This is the war story behind the heartbeat in "Building a Production WebSocket Backend That Survives Real Traffic." Next in the series: a head-to-head of Django Channels vs Node.js vs Phoenix for real-time workloads, with the same kind of numbers. If you have hit this leak too — or found a sneakier variant — the comments are open.
Comments (1)
Maya R.
2 hours ago
This is the kind of production WebSocket guide I wish more teams wrote. The lifecycle diagram and Redis notes are especially useful.
Reply
Join the conversation