Six‑Step Emergency Plan to Detect, Recover, and Eliminate Message Backlog
In distributed systems, message‑queue backlogs can cripple core services; this article breaks down a six‑step emergency workflow—from alert detection and throttling to temporary scaling, root‑cause analysis, targeted fixes, and final validation—plus long‑term architectural and monitoring strategies, illustrated with real‑world cases and Java code samples.
1. Definition and Impact of Message Backlog
Message backlog occurs when the production rate continuously exceeds the consumption rate, causing unprocessed messages to accumulate on the MQ broker and leading to growing latency. The article classifies backlogs into three severity levels (critical, moderate, minor) based on message count and delay, with critical backlogs (>1 million messages, >30 min delay) threatening core business, user experience, and data integrity.
2. Six‑Step Emergency Procedure (30‑minute window)
Step 1 – Alert Response & Backlog Confirmation (0‑5 min)
Goal: Verify the backlog’s reality, scale, and impact. Actions:
Receive alerts via monitoring tools (Prometheus + Grafana, ELK) and check three key metrics: total queue size, consumption rate, production rate.
Quantify backlog through the MQ console (e.g., RocketMQ, Kafka Manager) and compute growth rate.
Assess business impact by mapping the affected topic (e.g., order fulfillment) to critical flows such as payment or order creation.
Case 1: At 02:15 a.m., a RocketMQ alarm showed the order‑fulfillment topic swelling from 0 to 860 k messages, with a 28‑minute delay (production ≈ 3000 msg/s, consumption ≈ 120 msg/s), classified as critical.
Step 2 – Traffic Throttling & Degradation (5‑10 min)
Goal: Cut off the backlog source while protecting core services.
Apply producer‑side rate limiting (e.g., Guava RateLimiter.create(100.0)) to non‑core producers; keep core producers but enable batch merging.
Pause non‑core producers (marketing notifications, log sync) or switch them to local cache/DB sync.
Stop delivery of retry‑queue messages to avoid retry storms.
private final RateLimiter rateLimiter = RateLimiter.create(100.0);
public boolean sendNonCoreMessage(String topic, String message) {
if (!rateLimiter.tryAcquire()) {
log.warn("Non‑core message throttled: {}", message);
return false;
}
try {
rocketMQTemplate.send(topic, MessageBuilder.withPayload(message).build());
return true;
} catch (Exception e) {
log.error("Failed to send non‑core message", e);
return false;
}
}After throttling, production dropped from 3000 msg/s to 80 msg/s and backlog growth fell from 700 msg/s to 10 msg/s.
Step 3 – Temporary Scaling to Accelerate Consumption (10‑15 min)
Goal: Boost consumption capacity to drain existing backlog.
Scale consumer instances (e.g., from 3 to 15) ensuring the instance count does not exceed the number of topic partitions.
Increase consumer thread‑pool size (core and max threads) to match processing capability.
Enable batch pulling and processing (e.g., batch size 50) to raise throughput.
@Bean
public ExecutorService consumerExecutor() {
return new ThreadPoolExecutor(
20, // core threads (up from 10)
200, // max threads for burst
60, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(10000),
new ThreadPoolExecutor.CallerRunsPolicy()
);
}
@RocketMQMessageListener(
topic = "order_fulfillment",
consumerGroup = "fulfillment_group",
consumeMode = ConsumeMode.BATCH,
consumeThreadMax = 200,
batchMaxSize = 50)
public class OrderFulfillmentConsumer implements RocketMQListener<List<String>> {
@Override
public void onMessage(List<String> messages) {
batchProcess(messages);
}
}Scaling the consumer cluster to 18 instances and raising the thread pool to 20 core / 200 max threads lifted consumption from 120 msg/s to 1800 msg/s, quickly reducing the backlog.
Step 4 – Root‑Cause Diagnosis (15‑20 min, parallel with Step 3)
Goal: Identify the underlying issue to prevent recurrence.
Consumer‑side checks (highest priority): thread dumps (jstack) for deadlocks or long‑running operations, code bugs, resource saturation.
Producer‑side checks: traffic spikes, oversized messages, duplicate production.
Broker‑side checks: node health, disk I/O, queue load balance.
Case 1 root cause: jstack revealed 20 consumer threads blocked on CompletableFuture.get() waiting for an inventory service whose latency spiked from 80 ms to 900 ms; the consumer pool was fixed at 10 threads, preventing scaling.
Step 5 – Targeted Fixes (20‑25 min)
Goal: Resolve the identified problem and restore normal flow.
Consumer‑side: eliminate blocking calls (replace get() with asynchronous handling), fix buggy releases, expand resources.
Producer‑side: enforce message size limits (<8 KB), deduplicate, add rate limiting for spikes.
Broker‑side: restart faulty nodes, rebalance queues, tune flush strategies.
@Override
public void onMessage(OrderMessage message) {
CompletableFuture.supplyAsync(() -> fulfillmentService.process(message), consumerExecutor)
.thenAccept(result -> {
if (!result) retryOrSendToDLQ(message);
})
.exceptionally(e -> {
log.error("Consume failed", e);
return false;
});
}After refactoring, consumption stabilized at 2200 msg/s and backlog drained rapidly.
Step 6 – Business Validation & Recovery (25‑30 min)
Goal: Confirm zero backlog, data consistency, and resume normal production.
Verify queue size is zero and latency < 50 ms.
Cross‑check core business data (order status, inventory, payment) against producer and consumer logs.
Gradually lift producer throttling, restore non‑core workloads, and de‑scale temporary consumer instances.
Monitor for 30 min to ensure no relapse.
At minute 28, the 860 k backlog cleared, latency fell to 42 ms, and the order‑fulfillment service returned to normal.
3. Long‑Term Root‑Cause Strategies
Strategy 1 – Architectural Optimisation
Isolate core and non‑core traffic with separate topics and consumer groups.
Reserve 30‑50 % excess consumer capacity for traffic spikes.
Design peak‑shaving mechanisms (local cache, batch processing) for flash‑sale scenarios.
Configure dead‑letter queues and bounded retry policies (e.g., 3 attempts).
Case 2: A flash‑sale platform introduced RabbitMQ for peak‑shaving, isolated order topics, and reserved 50 % capacity; during a major sale, no backlog occurred and latency stayed under 30 ms.
Strategy 2 – Monitoring & Alerting
Track core MQ metrics (total messages, growth rate, consumption delay, production/consumption rates, broker health).
Set tiered alerts matching the three backlog levels (SMS, phone, full‑team paging).
Integrate tracing tools (SkyWalking, Pinpoint) to pinpoint slow SQL or RPC calls.
@Component
@Slf4j
public class MqMonitorService {
@Autowired private RabbitAdmin rabbitAdmin;
@Autowired private AlertService alertService;
@Value("${mq.monitor.interval:30000}") private long monitorInterval;
@Value("${mq.alert.threshold.level1:1000000}") private long level1Threshold;
@Value("${mq.alert.threshold.level2:100000}") private long level2Threshold;
@Scheduled(fixedRateString = "${mq.monitor.interval:30000}")
public void monitorQueue() {
try {
String queueName = "order_fulfillment";
Properties props = rabbitAdmin.getQueueProperties(queueName);
if (props == null) { log.warn("Queue {} not found", queueName); return; }
long backlog = Math.max(0, props.getMessageCount() - props.getUnacknowledgedMessageCount());
log.info("Queue {} backlog: {}", queueName, backlog);
if (backlog >= level1Threshold) {
alertService.sendAlert(AlertLevel.HIGH, "Queue " + queueName + " critical backlog: " + backlog);
} else if (backlog >= level2Threshold) {
alertService.sendAlert(AlertLevel.MEDIUM, "Queue " + queueName + " moderate backlog: " + backlog);
}
} catch (Exception e) {
log.error("MQ monitoring failed", e);
}
}
}Strategy 3 – Operational Governance
Enforce gray‑release and load‑testing before full deployment; forbid on‑the‑fly consumer config changes.
Centralise MQ, producer, and consumer configuration with approval workflow.
Conduct quarterly backlog‑recovery drills covering different severity scenarios.
Post‑incident reviews to capture lessons and close the loop.
Case 3: A financial platform suffered a 45 k backlog after a developer mistakenly reduced consumer core threads from 20 to 2; after instituting change‑approval and regular drills, no similar incidents recurred.
4. Common Pitfalls
Scaling consumer instances beyond the number of topic partitions yields no throughput gain.
Unlimited retry loops exacerbate backlog; set sensible retry limits and dead‑letter routing.
Ignoring producer throttling while scaling consumers leads to persistent backlog growth.
Skipping data validation after recovery can leave hidden inconsistencies.
5. Conclusion
The emergency workflow—rapid loss‑prevention, precise diagnosis, and thorough remediation—can resolve a severe MQ backlog within 30 minutes, while long‑term strategies of architectural balance, proactive monitoring, and disciplined operations prevent recurrence. Teams should tailor the steps to their own QPS profiles, rehearse regularly, and continuously refine the process to keep distributed systems resilient.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture & Thinking
🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
