One server breaks under pressure — here's how a thousand servers act like one

❝

Load balancing is the reason Google, Zomato, and your bank don't crash when a million people use them at the same moment. It's not magic — it's traffic direction done very, very fast.

❝

What's actually happening here?

A load balancer sits in front of your servers and distributes incoming requests across all of them. When a user makes a request, they never talk to your application servers directly — they talk to the load balancer, which picks one server, forwards the request, gets the response, and returns it to the user. From the user's perspective, it's one service. Behind the scenes, it's a fleet. The load balancer is the reason you can add or remove servers without any user ever noticing.

The problem this solves

A single server has hard physical limits — fixed CPU cores, fixed RAM, fixed network bandwidth. Once traffic exceeds that ceiling, every additional user degrades the experience for everyone. Without a load balancer, scaling means replacing your server with a bigger one — which has its own ceiling, still represents a single point of failure, and requires downtime to swap. With a load balancer, you add servers horizontally. No single machine matters. One dies — the load balancer stops sending it traffic. Ten more are added — the load balancer starts sending them traffic. The system grows without downtime.

How it really works (step by step)

What happens on every incoming request:

Request arrives at the load balancer — via DNS that resolves your domain to the load balancer's IP, not your app servers' IPs. The load balancer terminates the TCP connection.
Algorithm selects a backend server — based on the configured algorithm (see below). This decision happens in microseconds.
Load balancer opens a connection to the chosen server — or reuses a pooled connection from its internal connection pool. This is why load balancers can handle millions of concurrent connections without latency — they maintain persistent pools to backends.
Request forwarded, response received — the load balancer proxies the request to the backend and streams the response back to the client. It may add headers like X-Forwarded-For (original client IP) since the backend only sees the load balancer's IP.
Health checked continuously — load balancers poll every backend every few seconds. A server that fails a health check is immediately removed from rotation. No manual intervention required.

The five algorithms and when each is the right choice:

Round Robin — requests go to server 1, 2, 3, 1, 2, 3... Simple, works well when all requests take similar time and all servers are identical.
Weighted Round Robin — server A gets 3× the traffic of server B. Used when servers have different capacity — a 32-core machine should get more traffic than an 8-core one.
Least Connections — new requests go to whichever server currently has the fewest active connections. Essential for workloads where requests take wildly different amounts of time — a mix of 10ms and 30-second requests will cause hot spots under pure round robin.
IP Hash — the client's IP is hashed to always select the same server. This is sticky sessions by hash — the user always lands on the same backend. Used when server-side session state is unavoidable.
Power of Two Choices — pick two servers at random, send to the less loaded one. Mathematically proven to distribute load far more evenly than random selection while being much simpler than global least-connections tracking. Used internally by Google, Netflix, and Nginx Plus.

The part most tutorials skip

Layer 4 vs Layer 7 load balancing is a decision that determines what you can and cannot do. A Layer 4 load balancer operates at the TCP level — it sees IP addresses and port numbers, makes a routing decision, and forwards packets without reading the content. It's extremely fast (hardware can do this at line rate — millions of packets per second) but completely blind to what's inside the request.

A Layer 7 load balancer reads the actual HTTP content — headers, cookies, URL paths, request bodies. This lets it do things a Layer 4 balancer cannot: route /api/* to your API servers and /static/* to your CDN, inspect auth headers, perform SSL termination, add rate limiting, and route based on the Host header for virtual hosting. Every modern application load balancer (AWS ALB, Nginx, HAProxy) operates at Layer 7. The trade-off: parsing HTTP adds CPU cost. For most applications this is negligible. For raw throughput at terabit scale — DDoS mitigation, CDN edge routing — Layer 4 hardware load balancers handle the outer layer, with Layer 7 behind.

Real company doing this right now

Cloudflare's global load balancing routes across 300+ data centres using a combination of anycast routing and health-aware DNS. When you make a request, BGP routing brings you to the nearest Cloudflare edge. At that edge, a Layer 7 load balancer inspects your request and routes to the appropriate origin pool based on path, geolocation, and real-time origin health scores. If one origin region degrades — latency spikes above threshold or error rates climb above 1% — Cloudflare's health checks trigger automatic traffic steering to the next healthiest region in seconds, not minutes. The system processes 55 million HTTP requests per second globally with this architecture, with no human involvement in failover.

What breaks at scale?

Connection draining is the failure mode engineers discover in their first production deployment. When you remove a server from rotation for a deployment or maintenance, a naive load balancer immediately stops sending new requests to it — but existing long-lived connections (WebSockets, streaming responses, large file uploads) are abruptly terminated mid-flight. Users see broken uploads, dropped WebSocket connections, and partial responses. The fix is connection draining: the load balancer marks the server as "draining", stops sending new requests, but waits for in-flight requests to complete naturally before fully removing the server. AWS, GCP, and every production load balancer supports configurable drain timeouts — typically 30–300 seconds depending on your longest expected request duration.

The "aha" moment

A load balancer does not make your system faster — it makes your system scalable. Each individual request still takes the same time. What changes is how many requests you can serve simultaneously.

Your practical takeaway

Use Least Connections instead of Round Robin the moment your request duration varies — any API that mixes fast and slow endpoints will create hot spots under Round Robin. Least Connections costs almost nothing extra and eliminates the problem entirely.
Configure health check timeouts shorter than your alert thresholds — if your health check fires every 5 seconds with a 3-second timeout, a dead server is removed from rotation in under 10 seconds. Your on-call alert fires at 60 seconds. This means users never see the outage — the load balancer already handled it before you wake up.
Set connection drain timeouts equal to your longest p99 request duration — find the 99th percentile response time for your slowest endpoints, add 20% buffer, and use that as your drain window. Too short and you break long requests. Too long and your deployments take forever.

Lesson 12 · Stage 4 — Scalability Patterns · System Design Made Easy

One server breaks under pressure — here's how a thousand servers act like one

What's actually happening here?

The problem this solves

How it really works (step by step)

The part most tutorials skip

Real company doing this right now

What breaks at scale?

The "aha" moment

Your practical takeaway

KEEP READING

Learning System Design

Quick Links

Subscription