Operations 8 min read

How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency

At the 21st GOPS Global Operations Conference, New Oriental's senior operations manager Qi Chen detailed the demand, technical, and focus pressures that drove a phased, full‑process observability standardization, leveraging OpenTelemetry, Telegraf, Loki and CMDB tagging to achieve cost reduction and higher stability.

Efficient Ops
Efficient Ops
Efficient Ops
How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency

Conference Background

On October 26‑27, 2023, the 21st GOPS Global Operations Conference was held in Shanghai, featuring over 80 experts from institutions such as the China Academy of Information and Communications Technology, Agricultural Bank, and major telecom and finance sectors discussing DevOps, AIOps, SRE, and security.

Presentation Overview

Qi Chen, Senior Operations Manager at New Oriental Technology Education Group, presented “Helping Reduce Costs and Increase Efficiency: Standardizing New Oriental’s Observability System.”

Key Pressures

1. Demand Pressure

Lack of standardized management and reduced alarm channels.

Need to integrate OpenTelemetry data across the organization.

Developers require automated ingestion of cloud‑provided data, automated configuration management, and unified internal data queries.

2. Technical Pressure

The existing observability stack was fragmented and complex, leaving many teams without clear migration paths.

3. Focus Pressure

Balancing SRE responsibilities with infrastructure management required a standardized, platform‑based approach to maintain both business and observability stability while reducing costs.

Standardization Process

Based on the three pressures, a phased, end‑to‑end standardization was implemented. Monitoring was categorized by user experience, application business, services, and infrastructure, selecting appropriate objects, metrics, and tools. After consolidating data within OpenTelemetry, Event data and CMDB tagging were introduced to enable an application‑ID‑centric observability query panel and alerting system.

Before and After Monitoring Architecture

Before transformation, the monitoring stack relied on mixed‑cloud collection limited to on‑premise pulls, with legacy tools such as Zabbix, Nagios, and Cacti, and a complex Thanos architecture that incurred high maintenance costs and expensive Elasticsearch logging.

After transformation, a unified platform aggregates data via OpenTelemetry, leverages Loki for log persistence, and eliminates redundant legacy stacks.

Collection Layer – General Solution

The new collection framework uses Telegraf with three data‑type categories:

SNMP‑based collection for network devices and physical servers.

Plugin‑based collection via exporters.

JSON‑based collection for databases, big‑data services, and application metrics.

All outputs are funneled into the OpenTelemetry pipeline, while Loki handles hardware alerts and audit logs.

Application Log Collection

The ELK stack remains for application logs, but the collection method shifted from side‑car Filebeat to DaemonSet deployments within Docker/container environments, streamlining configuration and scaling.

Benefits Achieved

Standardization reduced server count by 35%, delivering significant cloud cost savings, and improved stability by optimizing resource limits for log collectors, preventing both over‑consumption and log loss.

monitoringcloud nativeObservabilityDevOpsOpenTelemetrySRECost Reduction
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.