Common Intermittent Issues in Backend Development and Their Case Studies
The article examines various intermittent problems that surface only in production environments—such as concurrency bugs, cache inconsistencies, dirty data, boundary‑value failures, and resource constraints—provides concrete code examples for each, and shares practical lessons to help developers diagnose and prevent these elusive issues.
During daily development, many developers encounter intermittent problems that only appear in production environments, even though the same code runs flawlessly in local, test, pre‑release, or gray‑release stages.
These "偶现问题" (intermittent bugs) usually arise under specific conditions and multiple factors, making them difficult to reproduce and often more damaging than stable bugs.
1. Scenario List
Concurrent Access, Asynchronous Programming, Resource Contention
Using non‑thread‑safe collections in parallel streams can lead to incorrect results when data volume is large.
List<XXXDO> dataList = 从 DB 中获取结果集合;
// 非线程安全集合
List<XXXDO> successList = new ArrayList();
List<XXXDO> failList = new ArrayList();
dataList.parallelStream().forEach(
vo -> {
// ...
if (执行成功) {
successList.add(vo);
} else {
failList.add(vo);
}
}
);Cache‑Related, Cache Consistency
Improper handling of local or distributed caches can cause short‑lived inconsistencies that are easily missed during testing.
Dirty Data, Data Skew
Dirty data often triggers anomalies such as a selectOne query returning multiple rows.
com.baomidou.mybatisplus.core.exceptions.MybatisPlusException: One record is expected, but the query result is multiple recordsBoundary Values, Timeouts, Rate Limiting
Long service chains, transformed exceptions, and swallowed logs increase the difficulty of troubleshooting boundary‑value failures and rate‑limit issues.
Server & Hardware Issues
Improper graceful shutdown, unhandled thread‑pool tasks, or hardware failures (e.g., disk full) can leave dirty data or cause service crashes.
Program Code Problems
Incompatible data structures, missing ThreadLocal.remove() , or modifying shared member variables without proper synchronization lead to unpredictable behavior.
// 正常情况能够执行 remove
try {
...
} finally {
threadLocalUser.remove();
}Asynchronous Dependencies
Submitting tasks to a thread pool without waiting for completion may return before results are populated, especially with large data sets.
List<XXXDO> dataList = 从 DB 中获取结果集合;
List<XXXDO> successList = new ArrayList();
List<XXXDO> failList = new ArrayList();
for (XXXDO vo : dataList) {
ThreadUtil.execute(() -> {
// API 操作 vo
if (执行成功) {
successList.add(vo);
} else {
failList.add(vo);
}
});
}
// 可能未获取运行结果就返回了Concurrency Modification
Non‑atomic operations like counter++ under high concurrency can produce incorrect counts.
public class UnsafeConcurrencyExample {
private static int counter = 0;
public static void main(String[] args) {
Thread thread1 = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
counter++;
}
});
Thread thread2 = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
counter++;
}
});
thread1.start();
thread2.start();
thread1.join();
thread2.join();
System.out.println("Counter: " + counter);
}
}Data Inconsistency
Cache expiration times that are too long can cause stale data to be served alongside fresh data, leading to inconsistent responses.
public class CacheExample {
private static Cache
cache = CacheBuilder.newBuilder()
.expireAfterWrite(10, TimeUnit.MINUTES)
.build();
public static void main(String[] args) {
String key = "data";
Object data = cache.get(key, () -> fetchDataFromDB());
System.out.println("Data: " + data);
}
private static Object fetchDataFromDB() {
System.out.println("Fetching data from DB...");
return "Data from DB";
}
}Graceful Shutdown Missing
Tasks running in a thread pool that are interrupted by redeployments or crashes can leave dirty data.
public class SimpleThreadPool {
private ExecutorService executor;
public SimpleThreadPool(int threads) {
executor = Executors.newFixedThreadPool(threads);
}
public void execute(Runnable task) {
executor.execute(task);
}
public void shutdown() {
executor.shutdown();
}
}Network Bandwidth & DDoS
Insufficient bandwidth or DDoS attacks can cause timeouts and degrade user experience, highlighting the need for load testing and monitoring.
Memory Leaks
Repeated creation of singleton‑like objects, unbounded MQ retries, and lack of back‑pressure can cause memory to balloon across a cluster.
3. Summary
Write clean, thread‑safe code; many bugs stem from low‑level mistakes.
Consider boundary values; they are often overlooked.
Maintain comprehensive logs to simplify troubleshooting.
Avoid indiscriminate exception conversion; handle errors properly.
Perform load testing; large data volumes expose concurrency and performance issues.
Never swallow exceptions; hidden errors become “ghost” problems.
Implement full‑stack monitoring; a single node failure can cascade across the system.
Ensure graceful shutdown and resource cleanup.
Intermittent problems are usually the result of insufficient attention to detail; repeated debugging builds experience that makes future investigations easier.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.