Backend Development 18 min read

WhatsApp’s High‑Reliability Architecture for 450 Million Users

This article examines WhatsApp’s high‑reliability architecture that supports 450 million users, detailing its Erlang‑based backend, hardware choices, scaling techniques, performance metrics, monitoring tools, and lessons learned from achieving up to two million concurrent connections on a single server.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
WhatsApp’s High‑Reliability Architecture for 450 Million Users

Service for 450 Million Users – High‑Reliability Architecture

Information Sources

The complete WhatsApp architecture is not publicly disclosed; the following information is compiled from talks, interviews, and articles that describe fragments of the system, especially the use of Erlang to achieve millions of concurrent connections on a single server.

1. Statistics

450 million active users, the fastest growth to that scale.

32 engineers, each supporting ~14 million active users.

500 billion messages per day across seven platforms.

Zero advertising spend, $8 million investment, hundreds of nodes and thousands of cores, hundreds of TB of RAM.

More than 70 million Erlang messages per second.

In 2011 a single server handled 1 million TCP sessions; by 2012 it handled 2 million, and in 2013 WhatsApp processed 180 billion messages daily.

2. Platform

Backend

Erlang

FreeBSD

Yaws, lighttpd

PHP

Custom BEAM patches (BEAM is the Erlang VM)

Custom XMPP implementation

Frontend

Seven client platforms: iPhone, Android, BlackBerry, Nokia Symbian 360, Nokia S40, Windows Phone, and an unknown platform

SQLite for local storage

3. Hardware

Dual Westmere Hex‑core servers (24 logical CPUs)

100 GB RAM, SSD storage

Dual NICs for public and private networks

4. Product Focus

Message delivery without geographic bias and without charging users.

Privacy: messages are not stored on servers; chat history lives only on the device.

5. General Observations

WhatsApp’s server side is almost entirely implemented in Erlang.

Early servers were based on ejabberd (an open‑source XMPP server written in Erlang) and were heavily customized.

Scaling to 500 billion daily messages required a focus on reliability rather than monetization.

System health is monitored via queue lengths; alerts trigger when thresholds are crossed.

Multimedia messages are uploaded to an HTTP server and referenced by URL and Base64 thumbnail.

Erlang’s hot‑code loading enables rapid feature rollout without restarts.

SSL sockets are used; messages are queued until the client reconnects to retrieve them.

Registration relies on phone numbers and a PIN‑based verification flow.

Google Push Service is used on Android.

6. Scaling a Single Server to 2 Million Connections

Initial load: 200 k concurrent connections per server.

Planned capacity expansions to handle traffic spikes (e.g., football matches, earthquakes).

Goal: reach 1 million connections per server, later 2 million, with dynamic capacity planning.

7. Tools and Techniques for Enhancing Scalability

System activity reporting tool (wsar) that records OS, hardware, BEAM, and process metrics.

Hardware performance counters (pmcstat) to measure emulator CPU usage.

DTrace, kernel lock counters, fprof for debugging.

Various measurements and synthetic workloads to emulate production traffic.

Hot‑loading of Erlang code to apply changes without downtime.

Patch‑based enhancements to BEAM, Mnesia, and the network stack.

8. Lessons Learned

Optimization is arduous and requires continuous tooling, testing, and data‑driven iteration.

Accurate measurement and bottleneck elimination are essential for scaling.

Erlang proves to be a robust, high‑performance platform despite the need for extensive tuning.

Keeping the system simple, avoiding ads, and focusing on user privacy contributed to rapid adoption.

Identity tied to phone numbers simplifies design but imposes constraints.

Gradual, purposeful redundancy ensures availability during staff vacations.

BackendscalabilityHigh AvailabilityErlangWhatsApp
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.