Qunar’s Journey to Containerization: Architecture, Challenges, and Solutions
This article details Qunar’s transition to cloud‑native containerization, describing the platform architecture, the technical hurdles encountered—including custom KVM hooks, logging visibility, Java remote debugging, and multi‑cluster performance—and the practical solutions implemented to achieve a stable, scalable production environment.
Background – In recent years container technology has matured, prompting many enterprises, including Qunar in late 2020, to embark on a cloud‑native transformation. By mid‑2022, over 150 applications were running in production containers, with more being onboarded.
Containerization Architecture Overview – The new architecture integrates the portal PAAS entry point, operational tools (watcher, bistoury, qtrace, Loki/ELK), middleware (MQ, config center, qschedule, Dubbo, MySQL SDK), the underlying Kubernetes and OpenStack clusters, and the Noah test‑environment management platform. The diagram of the architecture is shown in the original article.
Challenges Encountered
01 Compatibility with legacy KVM usage and custom preStart/preOnline hooks – Kubernetes only provides preStop and postStart hooks, which do not match the required semantics. The solution injects a custom preStart script via the container entrypoint and implements preOnline logic by polling the application health after a postStart hook, then executing the script once the checkurl succeeds.
02 Lack of real‑time logs during deployment – The standard Kubernetes API does not expose live STDIN/STDOUT, causing deployment stalls. By removing the problematic postStart hook and moving its functionality to a sidecar container with a shared volume, logs, hook output, and pod events become visible in real time.
03 Remote debugging of Java applications in containers – When a breakpoint is hit, the JVM hangs, causing the liveness probe to fail and the pod to be killed. The team switched the liveness probe expression to (checkurl == 200) || (socat process && java process alive) , preventing the probe from killing the container during debugging.
04 Multi‑cluster management performance issue with Rancher 2.5 – As namespaces grew beyond 3000, Rancher API latency degraded. After confirming a Rancher bug, the team migrated to Kubesphere, which resolved the performance bottleneck.
Summary and Outlook – The migration from KVM to containers required careful handling of legacy workflows, hook compatibility, observability, and debugging. Qunar’s cloud‑native journey is just beginning; future work will focus on cluster stability, resource utilization, and chaos engineering practices.
Reference: Kubernetes Container Lifecycle Hooks
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.