Online Monitoring Practices for DSP Advertising: Shifting Testing to Production
This article discusses the concept of test right‑shift—moving testing to post‑release production—by detailing a four‑layer online monitoring system for a DSP advertising platform, including interface‑level, UI‑level, revenue, and daily key‑metric monitoring, and shares real‑world incident examples.
Many readers are familiar with the ideas of "test left‑shift" and "test right‑shift". Test right‑shift moves testing to after product release, meaning monitoring in the production environment and obtaining real‑time user feedback. This article introduces an online monitoring approach based on that concept.
The article begins with a story: one evening an advertising service experienced a program fault that mistakenly blocked many channels. Because the incident occurred during a low‑traffic period, the daily revenue alarm threshold (20%) was not triggered. The next day, product engineers noticed a 50% drop in revenue and request volume during peak hours, leading to a rapid investigation and rollback that restored normal operation after 16 hours, causing significant loss.
For a DSP business that handles billions of ad requests daily, even a one‑minute anomaly during peak hours can result in substantial financial impact. Without effective monitoring, such a critical revenue stream becomes a hidden time bomb, as no one can guarantee continuous system stability or perform constant manual observation.
Based on this background, the team implemented a four‑layer monitoring system:
1. Interface‑level monitoring
Leveraging the existing Ialert monitoring system, each server is monitored at the interface level with business‑logic assertions. If an interface fails an assertion more than three times, an SMS alert is triggered. This ensures immediate awareness of any server‑side anomalies.
However, because the DSP workflow involves long upstream and downstream chains ending with material delivery to the ADX for bidding, upstream issues may not be detected by interface‑level monitoring alone.
2. UI‑level monitoring
Since the ultimate goal is proper ad display and click‑through, the team introduced UI‑level monitoring using PhantomJS to analyze the DOM of the final ad page, verify correct rendering, and simulate clicks to ensure navigation to the intended landing page. A pre‑built test page that reliably shows ads was used, and monitoring focused on specific ad slots, checking that images, text, and target URLs are correct.
This approach works well for detecting rendering or click‑through issues on specific channels, but it may still miss problems like accidental channel blocking that do not affect the UI of the monitored slots.
3. Revenue monitoring
Revenue is the most direct metric for ad business health. The company's BA system can generate revenue data every five minutes. The monitoring solution queries this API at the same interval, compares the current value with the same period last week and the previous day, and triggers an SMS alert if the deviation exceeds a predefined threshold.
4. Daily key‑metric monitoring
In addition to revenue, other critical metrics such as request count, successful bids, clicks, CPM, and CPC are tracked via the business monitoring platform. Daily email reports summarize these metrics and show week‑over‑week and day‑over‑day comparisons, helping teams quickly spot abnormal trends.
Monitoring effectiveness
After deploying the four‑layer monitoring system, several incidents were resolved quickly. For example, a server‑status anomaly caused by an operational mistake triggered frequent SMS alerts, allowing developers to fix the issue within 15 minutes. Another weekend incident involving massive abnormal traffic exhausted resources, but the alerts enabled rapid remediation, preventing prolonged downtime.
Conclusion
The DSP team’s online monitoring practices demonstrate that, while false positives and upstream data issues can still generate noise, focusing on reducing false alerts while increasing sensitivity shortens problem‑discovery time and significantly improves the reliability of the service after test right‑shift.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.