Cloud Native 16 min read

Design and Implementation of a Serverless Delayed Queue on AWS Using Kafka, SQS, and DynamoDB

This article presents a comprehensive design and implementation of a serverless delayed message queue on AWS, evaluating various solutions such as RabbitMQ, ActiveMQ, RocketMQ, Redis, and ultimately proposing a hybrid Kafka‑SQS‑DynamoDB architecture with performance optimizations and operational insights.

Ctrip Technology

Jun 29, 2023

Design and Implementation of a Serverless Delayed Queue on AWS Using Kafka, SQS, and DynamoDB

Background : As cloud migration progresses, many applications on AWS require delayed‑queue functionality, but Kafka, the chosen message broker, lacks native support for delayed messages.

Requirements : Variable delay times (5 minutes to 7 days), low message volume (≤ 1 billion per day, ≤ 1 MB each), no loss (ordering optional), delay error ≤ 2 seconds, and high peak production (up to 10 million messages at once).

Goals : Achieve low cloud cost, low operational and development cost, high stability, and minimal delay error.

Product Selection : Evaluated RabbitMQ, Apache ActiveMQ, SQS, and Redis. RabbitMQ and ActiveMQ lack serverless deployment on AWS; SQS supports only ≤ 15 minutes delay, insufficient for the use case.

Solution Research : Analyzed industry approaches (TTL‑based RabbitMQ, timer‑based ActiveMQ, level‑based RocketMQ, Redis ZSET polling and key‑expiration). Each had trade‑offs in simplicity, error, and potential message loss.

Implementation Scheme : Adopted a hybrid design combining Kafka, SQS, and DynamoDB. Normal messages go directly to Kafka topics; delayed messages are sent to a dedicated delay‑topic. A Service (deployed via ECS + Fargate) consumes the delay‑topic, routes messages with <15 min delay to SQS and longer delays to DynamoDB. A timer‑driven Scheduler scans DynamoDB, moves eligible messages to SQS, and the Service finally emits them to the target Kafka topic. FIFO SQS queues and deduplication ensure single‑scheduler execution and avoid single‑point failures.

Performance Optimizations : Increased Kafka partition count, scaled Service replicas, dynamically adjusted DynamoDB WCU/RCU, tuned ECS task scaling, and throttled message ingestion to smooth DynamoDB load.

Practical Results : After six months in production, delay error stays within 2 seconds with near‑100 % success rate; peak delayed‑message throughput reaches 500 msg/s; DynamoDB read/write throttling is negligible; Kafka backlog stays under 60 k messages; timer‑FIFO pipeline processes ~1 msg/min per replica.

Conclusion : The fully serverless, cloud‑native architecture delivers low‑cost, low‑maintenance delayed‑queue capabilities that meet stringent latency and reliability requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kafka aws delayed queue DynamoDB SQS

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.