Artificial Intelligence 25 min read

How Smart Operations (AIOps) Can Bridge Industry and Academia

At APMCon 2017, Tsinghua professor Pei Dan outlined the research challenges of intelligent operations, emphasizing the need to define clear algorithmic problems, foster open‑source collaboration between industry and academia, and build a problem‑library to accelerate AIOps adoption across enterprises.

Efficient Ops

Aug 22, 2017

How Smart Operations (AIOps) Can Bridge Industry and Academia

China's Application Performance Management Conference (APMCon 2017) was held on August 10‑11 in Beijing. The event, co‑organized by Tingyun, Jikebang and InfoQ, focused on "Driving Application Architecture Optimization and Innovation" and aimed to promote the growth of APM in China.

Preface Pei Dan, an associate professor of Computer Science at Tsinghua University and an expert in intelligent operations algorithms, delivered a talk titled "Research Problems in Intelligent Operations" at the main forum of APMCon 2017. He highlighted the importance of industry‑academia collaboration and the practical path for intelligent operations.

Pei began by explaining why the topic is relevant: intelligent operations (AIOps) is hot, but its practical adoption faces two main challenges. Industry has abundant data and applications but lacks mature algorithms and experience, while academia possesses strong algorithmic knowledge but lacks real‑world data and domain understanding.

To advance the field, he proposes defining concrete research problems with clear inputs and outputs, enabling both sides to work efficiently. He advocates open‑source data sharing so that academic researchers can contribute algorithms, and industry can benefit from those solutions.

1. Development History of Intelligent Operations

The evolution of operations moved from manual work to automated scripts, then to DevOps, and finally to intelligent operations. The decision‑making process shifted from human analysis to script‑based automation, and now to machine‑learning‑driven analysis.

According to Gartner, AIOps deployment was below 5% in 2016 and is expected to reach 25% globally by 2019, indicating a bright future.

2. Industry‑Academia Collaboration

Current collaboration is at "1.0" – one‑to‑one exchanges that are inefficient and slow. Pei suggests moving to "2.0" – an open‑source, community‑driven model where data, algorithms, and problem definitions are shared publicly.

He cites examples such as Hadoop, TensorFlow, arXiv, GitHub, ImageNet, and AI clouds that have accelerated progress through openness.

The proposed solution is to build an "Intelligent Operations Problem Library" that collects well‑defined research problems, associated datasets, and baseline algorithms. Industry would provide anonymized data and cloud resources; academia would develop and publish algorithms that can be directly applied.

3. Defining Research Problems in Intelligent Operations

Research problems should have clear, obtainable inputs, feasible outputs, a concrete technical roadmap, and be understandable by researchers outside the operations domain.

3.1 Basic Modules

KPI Bottleneck Analysis : Given a wide table of KPI values and related attributes, identify attribute combinations that degrade KPI performance (e.g., page‑load time). Algorithms include decision trees, clustering trees, and hierarchical clustering.

Fault Prediction : Predict failures (e.g., switch faults) hours in advance using logs. Techniques involve hidden Markov models, SVMs, and random forests.

3.2 Root‑Cause Analysis ("Butcher the Cow")

When an application issue occurs, the goal is to trace the cause through a fault propagation chain. Key steps are anomaly detection and constructing the propagation graph.

Anomaly Detection : Methods range from simple KPI trend prediction (ARIMA, EWMA, Holt‑Winters, RNN) to machine‑learning approaches (ensemble, transfer learning, deep learning). Reducing labeling effort is a major challenge; techniques such as similarity‑based detection (DTW, MK) and clustering (DBSCAN, K‑medoids) help.

Fault Propagation Chain : Building a graph where events trigger subsequent events enables root‑cause tracing. Algorithms include frequent pattern mining (FP‑Growth, Apriori), random forests, and statistical correlation analyses (Pearson, J‑Measure).

3.3 "Second‑Best" Solutions

When data is insufficient for full root‑cause analysis, simpler yet useful problems are tackled:

Intelligent Circuit‑Breaker : Detect whether a recent deployment caused a performance issue using CUSUM, SST, or DID algorithms.

Alarm Aggregation : Consolidate numerous alerts into a manageable set using hierarchical analysis or fault‑propagation‑based clustering.

Fault Localization : Estimate the likely failure region from performance metrics, employing random forests, logistic regression, Markov chains, or Dirichlet processes.

4. Summary and Outlook

Intelligent operations hold great promise due to abundant data and clear application scenarios, representing an untapped AI gold mine. Realizing this potential requires deeper industry‑academia cooperation, open data, shared problem libraries, and competitive platforms such as the upcoming "Intelligent Operations Algorithm Competition" hosted by Pei's NetMan lab.

By lowering the entry barrier for both sides, the community can accelerate AIOps research and deployment, much like ImageNet sparked the deep‑learning renaissance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

aiops Industry-Academia Collaboration Intelligent Operations

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.