Red‑Blue Adversarial Testing for a Big Data Platform: Process, Benefits, and Best Practices
This article outlines the red‑blue adversarial testing process for a big‑data platform during the Double‑Eleven promotion, detailing its purpose, benefits, step‑by‑step execution, common issues, and recommendations to improve system reliability and security.
During major promotional events, such as Double‑Eleven, the big‑data platform conducts comprehensive preparation work including full‑link stress testing, disaster recovery drills, degradation drills, rate limiting, monitoring inspections, and chaos engineering (red‑blue confrontation). The red‑blue exercise has become increasingly important as platform complexity grows.
Red‑blue confrontation is a common security exercise that discovers and rectifies deep‑level security risks in both internal and external network assets while ensuring stable business operation. It integrates threat monitoring, emergency response, and protection capabilities to conduct realistic attacks and defenses, improving both technical and management aspects of security.
In this context, the blue team simulates attackers, and the red team defends, testing system resilience and high‑availability under controlled conditions.
Benefits of Red‑Blue Confrontation
Ensures monitoring alerts are effective, timely, and accurate.
Enhances system reliability by identifying potential failure points.
Reduces risk by uncovering vulnerabilities that could be exploited.
Provides economical testing that simulates production scenarios without endangering the live environment.
Red‑Blue Practice Steps
1. Exercise Announcement : Organize kickoff meeting, set timeline, assign real‑time and offline contacts, and notify business users via email or instant messaging.
2. Personnel Assignment and Task Distribution : Designate a main responsible person, real‑time and offline contacts for both attack (blue) and defense (red) sides, backup personnel, and monitoring staff.
3. Pre‑Exercise Scenario Collection : Define application scope (prefer L0/L1), gather attack scenarios for both real‑time and offline services, and list target applications.
Example URLs for application lookup: http://XXX.jd.com/dashboard/4/node/XXX http://XXX.jd.com/health
Collected fault scenarios include high CPU, memory, disk usage, network latency/loss, process termination, MySQL/JimDB latency, and cluster‑level issues such as HDFS queue saturation or Zookeeper node failure.
4. Red‑Blue Exercise Process
Before the attack, the main responsible person sends a notification (template shown below) to the group.
@全体成员
【重要通知】
今天17:30~21:30大数据平台(实时+离线)进行红蓝对抗演练,不定时进行故障突袭。请各位同学将跟进处理过程在本群进行同步。 分三个阶段:问题发现、原因分析诊断、故障处理。
每个环节(问题发现、故障诊断、故障处理)确定后立马发消息,不要最后发总结!
每个环节(问题发现、故障诊断、故障处理)确定后立马发消息,不要最后发总结!
1、问题发现
【问题发现】
产品-服务名称:
(1)收到电话/咚咚告警,告警内容xxx
或(2)雷达大屏飘红,截图xx 开始排查处理
2、原因分析
【故障诊断】
产品-服务名称:xx问题原因已查到,原因概要描述。
3、故障处理
【故障处理】
产品-服务名称::xx问题已处理,已恢复,并给出告警恢复/监控截图。The blue team creates and executes tasks on the chaos engineering platform based on the collected scenarios.
Key points: avoid revealing exact attack times, prefer production‑level applications for realism, and be aware of limitations such as kernel bugs that prevent certain network fault simulations.
5. Red Team Defense and System Recovery : After attacks, the red team receives alerts, follows predefined runbooks to remediate, and may need to manually restart services for scenarios like process termination.
6. Result Collection and Review : The main responsible person reviews outcomes, documents issues, and gathers feedback. Common problems include delayed response, incomplete handling, and missed alerts due to misconfigured or disabled monitoring rules.
7. Post‑Exercise Retrospective : Organize a review meeting with architects, assess alarm levels, extend attack durations, promote regular chaos experiments, and refine emergency response procedures.
8. Platform Improvement Suggestions : Enable batch creation of chaos tasks, provide APIs for regular chaos experiments, and integrate MDC/UMP alert views within the platform.
Conclusion
Through this red‑blue exercise, the big‑data platform significantly enhanced system risk resistance, reduced production fault probability, improved developers' incident‑handling capabilities, and established an efficient, repeatable testing methodology.
END
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.