Fundamentals 6 min read

Why a Thread‑Only Model Struggles to Reach Million‑Level Concurrency on a Single Machine

The article analyzes why relying solely on operating‑system threads cannot easily achieve single‑machine million‑level concurrency, examining thread stack memory misconceptions, kernel‑level context‑switch costs, and how user‑space coroutine scheduling overcomes these limits.

IT Services Circle
IT Services Circle
IT Services Circle
Why a Thread‑Only Model Struggles to Reach Million‑Level Concurrency on a Single Machine

The well‑known C10K problem—how a single server can handle ten thousand concurrent connections—spurred the creation of I/O multiplexing mechanisms such as epoll and kqueue.

As applications push toward even higher concurrency, the traditional thread model shows its limits, especially when trying to reach a million concurrent tasks on one machine.

Many sources claim that a thread’s stack occupies megabytes while a coroutine’s stack is only kilobytes, implying that threads would exhaust memory at scale. In reality, the stack size reported by the OS is virtual memory; for example, Linux defaults to an 8 MB virtual stack per thread, which quickly consumes the 4 GB address space on 32‑bit systems (8 MB × 512 ≈ 4 GB). Even on 64‑bit systems, the number of threads is still bounded by kernel parameters such as vm.max_map_count .

If those limits are lifted and a thread’s stack only uses 1 KB of physical memory during execution, its real memory footprint becomes comparable to that of a coroutine, so user‑space stack size is not the bottleneck.

The remaining major issue is context‑switch overhead. Thread switching relies on preemptive scheduling in the kernel: each switch traps into kernel mode, saves and restores a full execution context, and may involve lock contention. For a million‑level thread workload, the CPU time spent on these kernel transitions alone becomes a performance bottleneck.

Coroutines, by contrast, perform scheduling entirely in user space. Switching a coroutine requires saving only a few registers, avoiding any kernel trap, which dramatically reduces the cost of a context switch.

Traditional threads use preemptive scheduling, where the OS can interrupt a thread at any moment. This is necessary for a general‑purpose OS but incurs high switching costs. Coroutines adopt cooperative scheduling: they voluntarily yield control, typically when performing I/O, calling yield , or waiting on locks or other synchronization primitives.

Executing an I/O operation (e.g., a network request)

Explicitly invoking a yield function

Waiting for a lock or other synchronization primitive

The advantages of cooperative scheduling are:

Predictable switch points: coroutine switches occur at clearly defined locations in the code.

Elimination of unnecessary switches: a switch happens only when the coroutine truly needs to wait, reducing wasted context changes.

Simplified synchronization: many cases can avoid complex lock mechanisms.

Thus, it is not that threads are inherently weak; rather, in I/O‑intensive scenarios, coroutines reshape the rules and make single‑machine million‑level concurrency much more attainable.

concurrencyhigh concurrencyCoroutinesIO MultiplexingThreadscontext switching
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.