Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.
1. Proactive Prevention
To avoid problems both during major promotions and normal operation, JD adopts a PDCA‑based approach: P lan, D o, C heck, A ct. The method emphasizes analyzing system responsibilities, setting primary business goals, evaluating capacity, addressing bottlenecks, and iterating on solutions.
1.1 Determine Primary Preparation Goal for Each System
Order transaction system – ensure order placement is guaranteed; other checks can be deferred.
OFC – keep production orders available without affecting capacity; allocate resources to POP orders when needed.
Logistics production system – satisfy worker operation requirements; non‑critical data can be processed asynchronously.
1.2 System Performance Evaluation
1.2.1 Performance Requirement Assessment
The company first estimates overall business volume, then each team breaks it down to its own throughput targets. Throughput (TPS) is measured as requests per second. Capacity planning considers business growth, per‑object resource consumption, data retention time, and acceptable backlog during peaks.
1.2.2 System Performance Evaluation and Verification
Online monitoring via UMP for response time and throughput.
Worker input/output statistics.
Offline interface stress testing.
Online read‑interface stress testing.
Online write‑interface stress testing with test tags to avoid polluting production data.
Reduce the number of servers to increase per‑server load.
Cross‑system “army drill” during promotions to validate end‑to‑end concurrency.
1.3 SLA Confirmation
Upstream and downstream systems must agree on SLA metrics. If a dependent system cannot meet its SLA, the dependent service will also fail. The SLA includes degradability options, timeout settings, and escalation procedures.
1.4 System Refactoring
Key actions to improve processing capability:
Hardware upgrade – higher‑performance servers.
Horizontal scaling – ensure each layer can scale, add clusters as needed.
System decomposition – split tightly‑coupled functions into independent subsystems.
Maintain response speed – see Section 1.4.2.
Programming techniques – parallelize serial code, batch processing.
1.4.2 Maintain Response Speed
During promotions, the goal is to keep the original response time or accept limited degradation. Methods include:
Review code, SQL, middleware, and hardware for bottlenecks.
Introduce caching at appropriate layers (page, request‑response, object, variable).
Optimize external dependencies by analyzing and streamlining non‑core flows.
Decompose the system to isolate heavy components.
1.4.3 Ensure Availability
1.5 Processing Capability Confirmation
The same evaluation and verification methods apply to confirm that the system can handle the expected load.
1.6 Pre‑Promotion System Health Check
Like athletes before a competition, a comprehensive health check ensures the system is ready for the promotion. Key inspection items are listed in the accompanying diagram.
2. Timely Problem Detection
Rapid detection relies on a complete monitoring and alerting system covering business, application, hardware, and network layers. The focus here is on application‑level monitoring.
3. Fast Decision‑Making and Execution
When incidents occur, having predefined emergency plans and rehearsals enables decisive actions to contain impact.
3.1 Emergency Plans
3.1.1 Risk Analysis
Common issues include service overload, dependency failures, and resource exhaustion.
3.1.2 Important Elements of a Plan
3.1.3 Common Plans and Handling Methods
3.2 Fast Decision
Clear division of responsibilities.
Assess business impact.
Verify system health via logs and metrics.
Check machine load, response time, and throughput.
Monitor alerts.
Conduct regular drills and improve coordination.
3.3 Fast Execution
Prepare configuration interfaces and collect necessary information in advance.
Set up deployment tasks beforehand.
Provide training and practice.
4. Mature and Stable Team
Rapid business changes require teams that own both development and operation, enabling continuous improvement and stable performance.
5. Process and Standards
Large‑scale promotions involve many teams; therefore, standardized processes and clear documentation are essential. A sample checklist is shown below.
6. Summary
Proactively prevent issues by following the PDCA model and embedding non‑functional design early.
Enhance early detection capabilities to intervene before problems surface.
Accelerate decision‑making and problem resolution to minimize impact.
Use promotion‑level traffic forecasts for system evaluation, conduct stricter checks, perform cross‑system drills, and enforce on‑site duty during the promotion.
Successful execution relies on a mature team and well‑defined processes.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.