Optimizing CPython for True Parallel Execution: Implementing a Multi-Interpreter Architecture
This article details a novel approach to overcoming CPython's Global Interpreter Lock by implementing a multi-interpreter architecture that isolates execution states, manages shared variables through thread-specific data, and introduces a subinterpreter pool to significantly enhance multi-core CPU utilization and algorithm execution performance.
In business scenarios, algorithm packages executed via CPython cannot utilize multiple CPU cores within a single process due to the Global Interpreter Lock (GIL). To address this, we optimized CPython to achieve high-completion parallel execution, significantly improving the performance of Python algorithm packages within a single process. Completed in 2020, this implementation is the industry's first CPython3 parallel solution that fully maintains compatibility with the Python C API, achieving a 7.7% single-thread performance degradation while reducing execution time by 44% per additional thread with minimal lock contention.
The GIL prevents true parallel execution of Python bytecode to ensure thread safety. Removing it traditionally requires fine-grained locking, which severely degrades single-thread performance, while isolating interpreter states involves massive refactoring of global variables. To overcome these challenges, we implemented a multi-interpreter architecture based on CPython 3.10. This design transitions from a global interpreter state to isolated interpreter structures, each holding its own execution state and independent GIL. By leveraging Thread Specific Data (TSD) to retrieve interpreter states, we enable parallel execution without modifying function signatures, effectively utilizing modern multi-core CPUs.
A major hurdle was isolating approximately 1,000 shared variables (e.g., free lists, singletons, caches, interned strings) that exist as global variables in CPython. Instead of passing interpreter state pointers through every function, we store these variables within the interpreter_state structure and access them via TSD. This approach minimizes performance overhead while ensuring thread safety. For example, accessing the Bigint freelist is streamlined through static inline PyInterpreterState* _PyInterpreterState_GET(void) { PyThreadState *tstate = _PyThreadState_GET(); return tstate->interp; }
Handling shared Type variables exposed via the Python C API required careful consideration to maintain backward compatibility. We addressed thread-unsafe reference count modifications and member variable changes by implementing immortal type objects, applying locks only to low-frequency unsafe operations, and marking frequently used descriptor members as immortal. Although this introduces minor memory leaks, it remains manageable due to CPython's module caching mechanism. Additionally, we replaced the default pymalloc memory pool with mimalloc, improving performance by 1-2% while eliminating the need to handle pymalloc's shared state.
To enhance usability, we extended the official subinterpreter module by adding capabilities to execute compiled .pyc files and arbitrary functions. The execution model involves creating a subinterpreter from the main interpreter, switching thread-specific states, passing arguments via shared memory, executing code, and returning results. We implemented a SubInterpreterPoolExecutor inspired by Python's concurrent.futures, which manages a pool of threads, each hosting a dedicated subinterpreter, providing a high-level asynchronous execution interface.
Given potential compatibility issues with third-party C/C++ extensions and certain Python modules, we developed an external dispatcher module that automatically routes execution to either the main interpreter (for compatibility) or subinterpreters (for performance). For high-performance scenarios requiring direct execution, we provided C APIs like GILGuard to bind execution directly to a specific interpreter, bypassing the overhead of cross-interpreter communication. This comprehensive optimization framework has passed CPython unit tests and is fully deployed in production environments.
ByteDance Terminal Technology
Official account of ByteDance Terminal Technology, sharing technical insights and team updates.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.