Operations 15 min read

WhatsApp Scaling Architecture: Lessons from Two Years of Growth

Over the past two years WhatsApp has dramatically expanded its user base, hardware, and traffic while maintaining a tiny engineering team, highlighting the challenges of massive scalability, Erlang‑based distributed design, Mnesia database bottlenecks, decoupling strategies, and operational patches required to keep the service reliable.

Art of Distributed System Architecture Design

Sep 20, 2016

WhatsApp Scaling Architecture: Lessons from Two Years of Growth

Two-Year Leap

WhatsApp’s scale today is incomparable to two years ago; this article summarizes the major changes observed over that period.

1. Despite massive growth in hosts, data centers, memory, users, and scalability challenges, the engineering team remains at ten engineers, each serving about 40 million users, thanks to cloud‑era separation of development and operations.

2. Earlier, single servers had to handle as many connections as possible; now the system has moved beyond that era, still controlling host count and improving SMP efficiency.

3. The architecture now focuses on throughput, caching, and sharding rather than storing large media formats.

4. Erlang remains the core language for the distributed system, praised throughout.

5. Mnesia, the Erlang database, has become a major source of problems, raising questions about over‑reliance on Erlang.

6. The massive scale introduces numerous issues: maintaining millions of connections, long priority queues, timers, code performance under varied loads, high‑priority message starvation, operation interference, resource failures, and cross‑platform compatibility.

7. Rick’s problem‑finding and resolution skills are highlighted as exceptional.

Statistics

4.65 × 10⁸ monthly users

190 billion messages received daily, 400 billion sent

600 million images, 200 million voice clips, 100 million videos

Peak concurrent connections: 147 million (telephone connections)

Peak login operations per second: 230 k

Peak inbound/outbound messages per second: 324 k / 712 k

~10 engineers handling both development and operations

Holiday peaks: e.g., Christmas Eve outbound traffic 146 Gb/s, 3.6 × 10⁸ video downloads, New Year’s Eve 2 × 10⁹ image downloads, one image downloaded 32 million times

Stack

Erlang R16B01 (with custom patches)

FreeBSD 9.2

Mnesia (database)

Yaws web server

SoftLayer cloud services and bare‑metal servers

Hardware

~550 servers plus backups

~150 chat servers (≈1 million phones each, 150 million connections at peak)

~250 multimedia servers

2 × 2690v2 Ivy Bridge 10‑core CPUs (40 hyper‑threads total)

Database nodes with 512 GB RAM

Standard compute nodes with 64 GB RAM

SSDs for reliability and video storage when needed

Dual‑link GigE (public user‑facing, private backend)

Erlang processes exceed 11 000 cores

System Overview

Erlang is favored for its SMP scalability and suitability for small engineering teams.

Rapid code updates are possible.

Scalability challenges are discovered and solved before they cause outages; major events (e.g., football matches) act as stress tests.

Traditional architecture: mobile clients connect to MMS (multimedia), chat connects to transient offline storage, messages flow through backend control, and chat also accesses databases (Account, Profile, Push, Group, etc.).

In‑memory Mnesia stores ~2 TB RAM across 16 shards, holding ~18 billion records, only for active messages and media.

Server capacity per node has shifted from 2 million to 1 million concurrent connections, with more functions now consolidated onto each server.

Decoupling

Isolate bottlenecks to prevent system‑wide failures.

Avoid tight coupling; separate front‑end and back‑end.

Maintain high throughput during issue resolution.

Use asynchronous processing to minimize latency.

Prevent head‑of‑line (HOL) blocking by separating read/write queues and node‑internal queues.

Employ FIFO models for uncertain latency scenarios.

Parallelism

Task distribution across >11 000 cores using gen_server, gen_factory, and a higher‑level gen_industry for parallel intake.

Service partitioning into 2–32 segments, using pg2 for distributed process groups and shard addressing.

Limit concurrent access to single ets or Mnesia processes to control lock contention.

Optimization

Added a write‑back cache achieving 98 % hit rate, reducing offline storage bottlenecks.

Patching the BEAM VM for asynchronous file I/O to alleviate mailbox and disk contention.

Isolated large mailboxes from cache to prevent heavy‑user impact.

Increased Mnesia fragments and split account tables into 512 shards (“islands”) to improve access speed.

Addressed hash bucket explosion and improved performance from factor 4 to factor 1.

Patches

Multiple timer wheels to reduce lock contention from millions of per‑second timers.

Patch mnesia_tm to handle transaction back‑pressure.

Added multiple async_dirty senders for Mnesia.

Optimized ets tables and prevented excessive dump queues.

Feb 22 Outage

A 210‑minute outage occurred after Facebook acquisition, caused by backend routing issues and an over‑coupled subsystem.

Router failure crippled a LAN, leading to massive node disconnects/reconnects and unprecedented instability.

pg2 generated n³ messages during reconnection, spiking queues to 4 million, prompting a dedicated patch.

Feature Release

Features are rolled out gradually: first under low traffic, then iterated quickly before wider deployment.

Updates are rolling; full BEAM upgrades require node‑by‑node restarts, with hot‑patches being rare.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Scalability Erlang Mnesia WhatsApp

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Two-Year Leap

Statistics

Stack

Hardware

System Overview

Decoupling

Parallelism

Optimization

Patches

Feb 22 Outage

Feature Release

Art of Distributed System Architecture Design

How this landed with the community

Was this worth your time?

0 Comments

Feb 22 Outage