Fundamentals 6 min read

Why a Thread‑Only Model Struggles to Reach Million‑Level Concurrency on a Single Machine

The article analyzes why relying solely on operating‑system threads cannot easily achieve single‑machine million‑level concurrency, examining thread stack memory misconceptions, kernel‑level context‑switch costs, and how user‑space coroutine scheduling overcomes these limits.

IT Services Circle

May 27, 2025

Why a Thread‑Only Model Struggles to Reach Million‑Level Concurrency on a Single Machine

The well‑known C10K problem—how a single server can handle ten thousand concurrent connections—spurred the creation of I/O multiplexing mechanisms such as epoll and kqueue.

As applications push toward even higher concurrency, the traditional thread model shows its limits, especially when trying to reach a million concurrent tasks on one machine.

Many sources claim that a thread’s stack occupies megabytes while a coroutine’s stack is only kilobytes, implying that threads would exhaust memory at scale. In reality, the stack size reported by the OS is virtual memory; for example, Linux defaults to an 8 MB virtual stack per thread, which quickly consumes the 4 GB address space on 32‑bit systems (8 MB × 512 ≈ 4 GB). Even on 64‑bit systems, the number of threads is still bounded by kernel parameters such as vm.max_map_count.

If those limits are lifted and a thread’s stack only uses 1 KB of physical memory during execution, its real memory footprint becomes comparable to that of a coroutine, so user‑space stack size is not the bottleneck.

The remaining major issue is context‑switch overhead. Thread switching relies on preemptive scheduling in the kernel: each switch traps into kernel mode, saves and restores a full execution context, and may involve lock contention. For a million‑level thread workload, the CPU time spent on these kernel transitions alone becomes a performance bottleneck.

Coroutines, by contrast, perform scheduling entirely in user space. Switching a coroutine requires saving only a few registers, avoiding any kernel trap, which dramatically reduces the cost of a context switch.

Traditional threads use preemptive scheduling, where the OS can interrupt a thread at any moment. This is necessary for a general‑purpose OS but incurs high switching costs. Coroutines adopt cooperative scheduling: they voluntarily yield control, typically when performing I/O, calling yield, or waiting on locks or other synchronization primitives.

Executing an I/O operation (e.g., a network request)

Explicitly invoking a yield function

Waiting for a lock or other synchronization primitive

The advantages of cooperative scheduling are:

Predictable switch points: coroutine switches occur at clearly defined locations in the code.

Elimination of unnecessary switches: a switch happens only when the coroutine truly needs to wait, reducing wasted context changes.

Simplified synchronization: many cases can avoid complex lock mechanisms.

Thus, it is not that threads are inherently weak; rather, in I/O‑intensive scenarios, coroutines reshape the rules and make single‑machine million‑level concurrency much more attainable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Concurrency Coroutines IO Multiplexing Threads context switching high-concurrency

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.