Lessons from QMQ: Network and Disk I/O Problems and Their Mitigations
The article analyzes real‑world network and disk I/O issues encountered in Qunar Message Queue (QMQ), explains root causes such as Netty OOM, file‑handle exhaustion, TCP timeout handling, and large‑traffic bursts, and presents practical mitigation strategies for backend systems.
QMQ (Qunar Message Queue) was originally built on MySQL storage and later migrated to a file‑based distributed architecture to handle growing message volumes. The article shares practical experiences from Ctrip’s deployment, focusing on two main problem domains: network and disk I/O.
1. Network Issues
1.1 OOM – An out‑of‑memory alarm on a broker slave was caused by off‑heap memory leakage during Netty message reception. The lack of back‑pressure and unchecked auto‑read led to continuous queuing and off‑heap growth. Conclusion: check channel.isWritable() before Netty writes.
1.2 File‑handle Exhaustion – TCP connections to MetaServer failed because the process ran out of file descriptors (limit 65536). Missing idle detection caused leaked connections. Conclusion: implement bidirectional idle detection.
1.3 Broker Not Removed – When a broker became unreachable, the heartbeat mechanism failed to mark it as non‑read/write, causing routing errors. A redesign added periodic DB scans by all MetaServers to mark lost brokers. Conclusion: consider network partition scenarios in distributed designs.
1.4 java.net.SocketTimeoutException – After a network outage, threads blocked on MySQL reads timed out after ~15 minutes due to Linux TCP retransmission timers and missing SO_TIMEOUT . Conclusion: configure SO_TIMEOUT on DataSources.
1.5 Large Traffic Burst – Sudden spikes caused full GC and OOM because Netty’s decode handler placed messages into an unbounded receive queue, delaying off‑heap reclamation. Mitigations included request‑size checks, rate limiting, bounded queues with timeout discard, and I/O latency monitoring. Conclusion: implement back‑pressure mechanisms.
2. Disk I/O Issues
2.1 Accumulated Message Pulls – The shared log file model leads to many random reads for long‑standing messages, increasing I/O utilization. Sorting message files and separating hot/cold data (e.g., mirroring to HBase) were suggested. Conclusion: consider hot‑cold separation.
2.2 Large Messages – Some topics contain messages >100 KB. Enabling producer‑side compression achieved 5‑8× size reduction, reducing disk write volume and I/O pressure. Conclusion: compress large payloads and optimise file encoding.
Finally, the article notes additional real‑world complications (packet loss, TCP retransmission failures, RAID issues, etc.) and outlines future work such as file‑encoding optimisation, page‑cache tuning, consumer pull redirection, and kernel upgrades.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.