Designing a High‑Performance Message Notification System
This article explains how to design and implement a high‑performance, scalable message notification system, covering service partitioning, system architecture, first‑time and retry message handling, idempotency, dynamic routing, thread‑pool management, stability measures such as traffic surge handling, resource isolation, monitoring, and elastic scaling.
1 Service Partition
Service division
System design
Stability assurance
Summary
2 System Design
2.1 First Message Sending
When a message‑sending request arrives, it can be processed via RPC or MQ. RPC avoids message loss, while MQ provides asynchronous decoupling and traffic shaping.
2.1.1 Idempotency Handling
To prevent duplicate processing, a Redis‑based idempotency check is used: a unique Redis key is set with a short TTL; if the same key already exists and the payload matches, the request is considered duplicate.
private boolean isDuplicate(MessageDto messageDto) {
String redisKey = getRedisKey(messageDto);
boolean isDuplicate = false;
try {
if (!RedisUtils.setNx(redisKey, messageDto, 30*60L)) {
isDuplicate = true;
}
if (isDuplicate) {
MessageDto oldDTO = RedisUtils.getObject(redisKey);
if (Objects.equals(messageDto, oldDTO)) {
log.info("消息重复了");
} else {
isDuplicate = false;
}
}
} catch (Exception e) {
isDuplicate = false;
}
return isDuplicate;
}2.1.2 Problem Service Dynamic Detector
The routing component distinguishes normal and problematic services. When a service shows high latency or failures within a configured time window, Sentinel marks it as problematic and routes subsequent requests to an exception executor, isolating it from normal traffic.
2.1.3 Sentinel Sliding Window Implementation (Ring Buffer)
Sentinel uses a ring‑buffer based sliding window to count requests and failures. The window index is calculated as (currentTime / bucketSize) % windowCount , and the start time of each bucket is updated when the window rolls over.
2.1.4 Dynamic Thread‑Pool Adjustment
Two separate thread pools are created: one for normal services and another for problematic services. The pool size is initially set based on CPU cores, and runtime metrics are used to adjust the size dynamically without changing the queue length.
ThreadPoolExecutor pool = new ThreadPoolExecutor(poolSize, poolSize, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue
());The executor checks its queue size; if it exceeds a threshold, new tasks are rejected and placed into MQ for later processing, preventing overload.
2.2 Retry Message Sending
Messages that fail or exceed capacity are retried using a distributed scheduled‑task framework with shard‑broadcast. The framework ensures that each retry respects the same idempotency and routing rules.
public void init() {
ScheduledExecutorService scheduledService = new ScheduledThreadPoolExecutor(taskRepository.size());
for (Map.Entry
entry : taskRepository.entrySet()) {
final String taskName = entry.getKey();
final TaskHandler handler = entry.getValue();
scheduledService.scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
try {
if (handler.isBusy()) {
return;
}
handleTask(taskName, handler);
} catch (Throwable e) {
logger.error(taskName + " task handler fail!", e);
}
}
}, 30, 5, TimeUnit.SECONDS);
}
}Task execution is guarded by a distributed lock to avoid duplicate processing across nodes.
public void handTask(String taskName, TaskHandler handler) {
Lock lock = LockFactory.getLock(taskName);
List
taskList = null;
try {
if (lock.tryLock()) {
taskList = getTaskList(taskName, handler);
}
} finally {
lock.unlock();
}
if (taskList == null) return;
handler.handleData(taskList);
}2.2.1 ES and MySQL Data Synchronization
Message send records are stored in MySQL, while searchable logs are indexed in Elasticsearch (ES). Synchronization uses the update_time field as a version stamp; only records with a newer update_time than the one stored in ES are applied, ensuring consistency.
3 Stability Assurance
3.1 Traffic Surge
Two‑level degradation is applied. For gradual spikes, the thread pool marks itself busy, and messages are off‑loaded to MQ for asynchronous persistence. For sudden spikes, Sentinel immediately routes excess traffic to MQ, where delayed consumption processes it when resources become available.
3.2 Resource Isolation for Problem Services
Problematic services are executed in a separate thread pool, preventing long‑running or failing calls from starving normal services.
3.3 Protection of Third‑Party Services
Third‑party integrations are wrapped with rate‑limiting and circuit‑breaker logic to avoid cascading failures.
3.4 Middleware Fault Tolerance
The design anticipates middleware outages (e.g., MQ downtime) by providing fallback mechanisms and graceful degradation.
3.5 Comprehensive Monitoring System
A full‑stack monitoring solution tracks request latency, error rates, thread‑pool usage, and resource saturation, enabling early detection and rapid response.
3.6 Dual‑Active Deployment and Elastic Scaling
Services are deployed across multiple data centers with active‑active configuration and can scale elastically based on real‑time metrics, balancing cost and performance.
4 Summary
Effective message notification systems require careful service partitioning, robust system design, and comprehensive stability measures such as traffic shaping, resource isolation, monitoring, and elastic scaling. No single solution fits all scenarios; architects must tailor designs to specific business requirements.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.