How Smart Operations (AIOps) Can Bridge Industry and Academia
At APMCon 2017, Tsinghua professor Pei Dan outlined the research challenges of intelligent operations, emphasizing the need to define clear algorithmic problems, foster open‑source collaboration between industry and academia, and build a problem‑library to accelerate AIOps adoption across enterprises.
China's Application Performance Management Conference (APMCon 2017) was held on August 10‑11 in Beijing. The event, co‑organized by Tingyun, Jikebang and InfoQ, focused on "Driving Application Architecture Optimization and Innovation" and aimed to promote the growth of APM in China.
Preface Pei Dan, an associate professor of Computer Science at Tsinghua University and an expert in intelligent operations algorithms, delivered a talk titled "Research Problems in Intelligent Operations" at the main forum of APMCon 2017. He highlighted the importance of industry‑academia collaboration and the practical path for intelligent operations.
Pei began by explaining why the topic is relevant: intelligent operations (AIOps) is hot, but its practical adoption faces two main challenges. Industry has abundant data and applications but lacks mature algorithms and experience, while academia possesses strong algorithmic knowledge but lacks real‑world data and domain understanding.
To advance the field, he proposes defining concrete research problems with clear inputs and outputs, enabling both sides to work efficiently. He advocates open‑source data sharing so that academic researchers can contribute algorithms, and industry can benefit from those solutions.
1. Development History of Intelligent Operations
The evolution of operations moved from manual work to automated scripts, then to DevOps, and finally to intelligent operations. The decision‑making process shifted from human analysis to script‑based automation, and now to machine‑learning‑driven analysis.
According to Gartner, AIOps deployment was below 5% in 2016 and is expected to reach 25% globally by 2019, indicating a bright future.
2. Industry‑Academia Collaboration
Current collaboration is at "1.0" – one‑to‑one exchanges that are inefficient and slow. Pei suggests moving to "2.0" – an open‑source, community‑driven model where data, algorithms, and problem definitions are shared publicly.
He cites examples such as Hadoop, TensorFlow, arXiv, GitHub, ImageNet, and AI clouds that have accelerated progress through openness.
The proposed solution is to build an "Intelligent Operations Problem Library" that collects well‑defined research problems, associated datasets, and baseline algorithms. Industry would provide anonymized data and cloud resources; academia would develop and publish algorithms that can be directly applied.
3. Defining Research Problems in Intelligent Operations
Research problems should have clear, obtainable inputs, feasible outputs, a concrete technical roadmap, and be understandable by researchers outside the operations domain.
3.1 Basic Modules
KPI Bottleneck Analysis : Given a wide table of KPI values and related attributes, identify attribute combinations that degrade KPI performance (e.g., page‑load time). Algorithms include decision trees, clustering trees, and hierarchical clustering.
Fault Prediction : Predict failures (e.g., switch faults) hours in advance using logs. Techniques involve hidden Markov models, SVMs, and random forests.
3.2 Root‑Cause Analysis ("Butcher the Cow")
When an application issue occurs, the goal is to trace the cause through a fault propagation chain. Key steps are anomaly detection and constructing the propagation graph.
Anomaly Detection : Methods range from simple KPI trend prediction (ARIMA, EWMA, Holt‑Winters, RNN) to machine‑learning approaches (ensemble, transfer learning, deep learning). Reducing labeling effort is a major challenge; techniques such as similarity‑based detection (DTW, MK) and clustering (DBSCAN, K‑medoids) help.
Fault Propagation Chain : Building a graph where events trigger subsequent events enables root‑cause tracing. Algorithms include frequent pattern mining (FP‑Growth, Apriori), random forests, and statistical correlation analyses (Pearson, J‑Measure).
3.3 "Second‑Best" Solutions
When data is insufficient for full root‑cause analysis, simpler yet useful problems are tackled:
Intelligent Circuit‑Breaker : Detect whether a recent deployment caused a performance issue using CUSUM, SST, or DID algorithms.
Alarm Aggregation : Consolidate numerous alerts into a manageable set using hierarchical analysis or fault‑propagation‑based clustering.
Fault Localization : Estimate the likely failure region from performance metrics, employing random forests, logistic regression, Markov chains, or Dirichlet processes.
4. Summary and Outlook
Intelligent operations hold great promise due to abundant data and clear application scenarios, representing an untapped AI gold mine. Realizing this potential requires deeper industry‑academia cooperation, open data, shared problem libraries, and competitive platforms such as the upcoming "Intelligent Operations Algorithm Competition" hosted by Pei's NetMan lab.
By lowering the entry barrier for both sides, the community can accelerate AIOps research and deployment, much like ImageNet sparked the deep‑learning renaissance.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.