How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency
At the 21st GOPS Global Operations Conference, New Oriental's senior operations manager Qi Chen detailed the demand, technical, and focus pressures that drove a phased, full‑process observability standardization, leveraging OpenTelemetry, Telegraf, Loki and CMDB tagging to achieve cost reduction and higher stability.
Conference Background
On October 26‑27, 2023, the 21st GOPS Global Operations Conference was held in Shanghai, featuring over 80 experts from institutions such as the China Academy of Information and Communications Technology, Agricultural Bank, and major telecom and finance sectors discussing DevOps, AIOps, SRE, and security.
Presentation Overview
Qi Chen, Senior Operations Manager at New Oriental Technology Education Group, presented “Helping Reduce Costs and Increase Efficiency: Standardizing New Oriental’s Observability System.”
Key Pressures
1. Demand Pressure
Lack of standardized management and reduced alarm channels.
Need to integrate OpenTelemetry data across the organization.
Developers require automated ingestion of cloud‑provided data, automated configuration management, and unified internal data queries.
2. Technical Pressure
The existing observability stack was fragmented and complex, leaving many teams without clear migration paths.
3. Focus Pressure
Balancing SRE responsibilities with infrastructure management required a standardized, platform‑based approach to maintain both business and observability stability while reducing costs.
Standardization Process
Based on the three pressures, a phased, end‑to‑end standardization was implemented. Monitoring was categorized by user experience, application business, services, and infrastructure, selecting appropriate objects, metrics, and tools. After consolidating data within OpenTelemetry, Event data and CMDB tagging were introduced to enable an application‑ID‑centric observability query panel and alerting system.
Before and After Monitoring Architecture
Before transformation, the monitoring stack relied on mixed‑cloud collection limited to on‑premise pulls, with legacy tools such as Zabbix, Nagios, and Cacti, and a complex Thanos architecture that incurred high maintenance costs and expensive Elasticsearch logging.
After transformation, a unified platform aggregates data via OpenTelemetry, leverages Loki for log persistence, and eliminates redundant legacy stacks.
Collection Layer – General Solution
The new collection framework uses Telegraf with three data‑type categories:
SNMP‑based collection for network devices and physical servers.
Plugin‑based collection via exporters.
JSON‑based collection for databases, big‑data services, and application metrics.
All outputs are funneled into the OpenTelemetry pipeline, while Loki handles hardware alerts and audit logs.
Application Log Collection
The ELK stack remains for application logs, but the collection method shifted from side‑car Filebeat to DaemonSet deployments within Docker/container environments, streamlining configuration and scaling.
Benefits Achieved
Standardization reduced server count by 35%, delivering significant cloud cost savings, and improved stability by optimizing resource limits for log collectors, preventing both over‑consumption and log loss.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.