Refactoring a Decade-Old Query Optimizer: Architecture, DIFF Fixes, Performance Gains, and Stability Improvements
Tencent engineers completely rewrote a ten‑year‑old query optimizer, shrinking the codebase by 80% and replacing its monolithic thread‑pool with a tRPC‑Fiber DAG scheduler, which cut latency by 28%, reduced startup time to five minutes, saved 40 GB memory, boosted throughput 12%, fixed numerous stability bugs, raised test coverage above 60%, and accelerated new‑feature lead time to under a day.
Recently, our team took over and completely refactored a Query Optimizer (QO) module that had been in production for more than ten years. The rewrite reduced the codebase by about 80% and brought significant improvements in performance, stability, and observability.
The main motivations for the refactor were: legacy tooling that could not use modern C++, excessive memory consumption (114 GB), a 18‑minute service startup time, occasional latency spikes, and low development efficiency (simple features required three person‑days).
We analyzed the old code and classified each component into four categories—delete, library import, sub‑repository import, and rewrite. The new design follows solid software principles such as single‑responsibility, interface segregation, least‑knowledge, and modular encapsulation.
The service architecture was changed from a monolithic thread‑pool model to a tRPC‑Fiber based DAG scheduler. Query processing now performs three tokenization passes (punctuation‑free, punctuation‑aware, and final output) and removes unnecessary RPC calls, greatly increasing parallelism and reducing overall latency.
During the DIFF phase we built an XML‑based diff tool that supports multithreaded execution, multi‑round comparison, floating‑point tolerance, and detailed field‑level reporting. Systematic diff‑location methods include logic‑flow tracing, multi‑stage I/O checks, log comparison, and GDB breakpoint debugging.
Common sources of diff were identified: external library initialization issues, environment version mismatches, post‑processing inconsistencies, bugs introduced by rewrites, input preprocessing differences, and random variations caused by unstable algorithms.
Stability fixes addressed several coredump scenarios: missing return statements and sprintf buffer overflows that corrupted stack memory; thread‑unsafe tokenization objects replaced with thread‑local instances via a tRPC‑managed thread pool; a bug in tRPC Fiber stack‑size configuration that caused crashes; and misuse of Redis connection‑pool APIs (mixing request‑reply and one‑way calls) that also led to crashes.
The final results showed no functional diff, a 28.4% reduction in average latency (13.01 ms → 9.31 ms), a 16.7% improvement in P99 latency (30 ms → 25 ms), a 12% increase in throughput (728 qps → 832 qps), startup time reduced from 18 minutes to 5 minutes, memory usage lowered by 40 GB, unit‑test coverage raised above 60%, and lead time for new features cut from three days to under one day. Success rate rose to 99.99%.
These findings and techniques are shared as technical insights from Tencent engineers.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.