Operations 6 min read

Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform

The Sentinel system was built to provide real‑time, zero‑modification monitoring of airline ticket business services by consuming Tianwang logs through a Storm cluster, offering flexible rule configuration, addressing performance pitfalls, and planning future enhancements such as custom monitoring scripts and visual dashboards.

Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform

Background

For a long time the airline ticket business system operated without any monitoring, relying on customer service feedback to report issues to developers, which often took days to detect and resolve, caused unreproducible problems, and resulted in significant losses during high‑risk incidents.

Solution

To avoid repeat incidents, the team created the Sentinel system, a monitoring solution that watches the business system from the moment an anomaly occurs. By leveraging the existing Tianwang log infrastructure (which already writes to MQ), Sentinel consumes these logs via a Storm real‑time computation cluster, requiring no changes to the business services.

The design includes a powerful log rule configuration engine that can filter logs by Tianwang dimensions, extract text using cursor‑based slicing, and apply awk‑like expressions, covering roughly 99% of user needs.

Two illustrative screenshots show the cursor‑based extraction and manual extraction interfaces.

Pitfalls Encountered

During trial runs, several machines showed massive consumption backlogs. Initial suspicion fell on hardware performance, but after ruling that out, the team discovered that a sampling component placed in a blocking queue caused thread blockage under high concurrency. Refactoring this logic to asynchronous processing reduced the backlog, but further issues persisted due to an overly long HTTP timeout (3 minutes) when forwarding to Kafka. Reducing the timeout dramatically improved consumption throughput.

Post‑analysis highlighted the significant impact of consumption latency on system performance under high‑throughput log processing.

Future Planning

Since Sentinel’s launch, over a dozen business systems have been integrated, achieving timely anomaly detection. Upcoming iterations will expose monitoring data statistics APIs, support custom rule scripts, and provide front‑end visualizations for monitoring data.

monitoringreal-timeoperationskafkastormlog-processing
Tongcheng Travel Technology Center
Written by

Tongcheng Travel Technology Center

Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.