Operations 10 min read

How InnoLive Cut Monitoring Costs by 86% with Nightingale

This article details InnoLive's migration from Open‑Falcon to the Nightingale monitoring platform, describing the pain points of their previous system, the selection process, deployment architecture, collection practices, and the substantial cost and performance benefits achieved.

Inke Technology
Inke Technology
Inke Technology
How InnoLive Cut Monitoring Costs by 86% with Nightingale

Background

InnoLive (映客直播) needed a robust, efficient monitoring system for its live‑streaming services. The existing Open‑Falcon solution suffered from high machine resource consumption, limited alerting capabilities, and cumbersome data collection, prompting a search for a better alternative.

Key Pain Points of Open‑Falcon

Heavy resource usage : High disk I/O and memory pressure on SSDs, requiring frequent scaling.

Unmet new requirements : Inability to view raw historical data beyond a few hours, limited multi‑metric calculations, and slow dashboard loading.

Alerting limitations : Few alert functions, no support for composite or ratio alerts, and rigid rule matching.

Collection difficulties : No easy integration with third‑party systems such as Kafka or Kubernetes.

Selection Criteria

The team evaluated solutions based on machine resource consumption, ecosystem completeness for data collection, and alignment with business needs. After testing, they chose Nightingale (夜莺监控) v5 as an all‑in‑one replacement for the Prometheus + AlertManager + Grafana stack.

About Nightingale

Nightingale is an open‑source, cloud‑native monitoring system that integrates data collection, visualization, alerting, and analysis. It offers out‑of‑the‑box enterprise‑grade capabilities and tight integration with cloud‑native ecosystems.

Deployment Architecture

The platform is split into domestic and overseas clusters across four data centers (A, B, C, D). Core components include:

n9e‑webapi : Configuration management for alerts, silences, subscriptions, scripts, and permissions.

n9e‑server : Alert engine that evaluates PromQL queries against time‑series stores.

thanos : Receives and stores three months of raw data before archiving to OSS.

victoriametrics : High‑performance TSDB handling the majority of series with modest CPU, memory, and disk usage.

prometheus : Deployed in overseas data centers for metric collection, with careful reload handling.

Data Collection Practices

Machine metrics : Collected via Telegraf wrapped in n9e‑agent for unified deployment.

Business metrics : Reported directly through an SDK and forwarded via the Open‑Falcon /v1/push API to Nightingale.

Middleware metrics : Scraped by Prometheus exporters deployed per data center.

Kubernetes metrics : Captured by existing K8s Prometheus instances and remotely written to n9e‑server.

Results and Benefits

The migrated platform now handles over 500 million series (actual ~250 million) with stable performance. Key gains include:

86% reduction in machine costs, shrinking the fleet from ~80 nodes to just over 20.

Lowered collection overhead; Kafka, Zookeeper, Consul, etc., are now gathered via Prometheus without custom code.

More flexible alerting and dashboarding that meet current business demands.

Improved performance: Victoriametrics runs on standard cloud disks with modest resource usage, while legacy Falcon graphs required high‑end SSDs.

Implementation Tips

During rollout, the team made several adaptations to ease developer adoption:

Optimized Telegraf intervals (e.g., CPU and disk I/O every 15 s, disk metrics every 60 s).

Enabled inputs.exec for custom scripts and added log‑keyword monitoring via scripts or mtail.

Automated business‑line tagging for metrics to improve query speed and prevent cross‑team alerts.

Extended alerts with SMS and phone notifications via a notify.py wrapper.

Automated dashboard imports per business line and synchronized user accounts with the permission system, creating dedicated dev and SRE teams.

Conclusion

Adopting Nightingale solved Open‑Falcon's high cost and limited functionality issues while supporting future growth. The authors express gratitude to the Nightingale community and look forward to continued feature enhancements.

monitoringCloud Nativelive streamingoperationsOpen-Falconcost reductionNightingale
Inke Technology
Written by

Inke Technology

Official account of Inke Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.