Operations 19 min read

How Large Language Models Can Transform Ops Fault Handling: A Practical Guide

This article outlines a typical operations incident workflow, identifies four key stages where large language models can assist, discusses implementation challenges, introduces the Ops framework and Copilot design, and shares practical examples and a real‑world case to help engineers adopt AI‑driven fault management.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How Large Language Models Can Transform Ops Fault Handling: A Practical Guide
This content originates from an internal sharing session and recent work summary.

1. Common Fault Handling Process

The diagram above shows a typical ops incident handling flow.

Key timestamps along the timeline are:

Fault occurrence

Fault detection

Fault response

Fault localization

Fault recovery

From occurrence to detection depends on metric collection and alert intervals (e.g., 15 s collection, 1 min detection). Detection to response varies by time of day; during off‑hours response may take hours, while in working hours it can be minutes.

Response to localization requires identifying the root cause; this depends heavily on the engineer’s experience. Newcomers may need hours, while seasoned ops can pinpoint issues in minutes.

Localization to recovery involves fixing the issue and restoring the service level objective (SLO). Some problems (e.g., application bugs) require developer involvement after ops have identified the cause.

These five stages form the core incident workflow; subsequent steps such as SLO observation, post‑mortem, optimization, and chaos testing are beyond the basic handling process.

2. Stages Where Large Models Can Contribute

According to the analysis, large models can intervene at four points: discovery, response, localization, and handling.

2.1 Discovery

When a fault is discovered, humans have not yet responded. An AI agent that automatically reacts to alerts could achieve the fastest response and significantly reduce mean time to failure (MTTF).

However, early intervention is difficult because it requires an AI agent that can automatically ingest alerts, collect metrics, call platform APIs, and even log into machines to attempt remediation.

Implementing such an agent is non‑trivial; underestimating the complexity of real‑world operations is a common mistake.

2.2 Response

During response, a large model that performs preliminary analysis can narrow the fault scope, accelerating subsequent localization.

Pre‑analysis relies on a well‑maintained knowledge base of past incidents; sufficient root‑cause data is essential for the model to be effective.

2.3 Localization

Observability data now includes events, metrics, logs, and traces, increasing the number of data sources to query.

A large model can shorten the time needed to query these sources and, based on keywords, retrieve relevant documentation and suggest remediation steps.

2.4 Handling

The model can also execute remediation actions such as restarting a Deployment, restarting Kubelet, adjusting routing, or moving a node—typically a single command or API call.

Automating these actions through the model saves considerable time.

2.5 Summary

While large models can participate in many stages, earlier intervention is harder; later stages are easier to implement. A practical approach is to start with later stages, accumulate documentation and cases, and gradually move the AI agent’s involvement forward.

Early‑stage faults involve a broad scope that large models struggle to capture; human expertise still excels in flexibility and learning.

The eventual strategy is to let the model handle later stages first, then iteratively shift its involvement earlier as the knowledge base grows.

3. Challenges When Using Large Models for Fault Handling

3.1 Converting Text to Ops Actions

Large models typically output text, images, or video. Translating this output into concrete commands or operational actions is the first hurdle.

3.2 Unstable Information Extraction

Determinism is crucial for automation, yet large models are inherently nondeterministic. Common issues include misunderstood intent, incorrect output format, missing parameters, etc.

Prompt engineering

Retry mechanisms

Model fine‑tuning

Beyond these, application‑level design can also mitigate instability.

3.3 Rapid Scenario Integration

Fast validation and iteration should be ingrained in engineers. By abstracting atomic operations and composing them into pipelines, countless scenarios can be covered.

Alert handling

Daily ops assistance

4. Key Technology – Ops Overview

Each domain may need an "Ops" project that provides large‑model‑driven capabilities.

OpsObject – stores operation objects via CRD, manages clusters and hosts.

Core – implements file distribution and script execution.

Task – packages and composes operations, providing lightweight orchestration.

Tools – offers three external entry points.

4.1 Example – Viewing Objects

The UI shows cluster node count, certificate expiry, node configuration, and GPU status.

4.2 Example – Opscli

shell – execute scripts on hosts.

file – transfer files between hosts, S3, or image registries.

task – orchestrate multiple shell/file operations.

Supports kubeconfig‑only credentials for node‑level execution.

Supports SA authentication in Kubectl.

4.3 Example – Web UI

Server – provides API endpoints.

Web – simple management UI.

4.4 Example – Task

Task defines a reusable template.

<code>apiVersion: crd.chenshaowen.com/v1
kind: Task
metadata:
  name: cron-clear-disk
  namespace: ops-system
spec:
  desc: cron to create clear disk
  selector:
    managed-by: ops
  typeRef: host
  steps:
    - name: clear > 100M log
      content: find /var/log -type f -name "*.log" -size +100M -exec rm -f {} \; 2>/dev/null || true
    - name: clear jfs cache
      content: |
        find /data/jfs/cache2/mem -maxdepth 1 -type d -atime +15 -exec rm -rf {} + 2>/dev/null || true
        find /var/lib/jfs/cache -maxdepth 1 -type d -atime +15 -exec rm -rf {} + 2>/dev/null || true
        find /var/lib/jfs/cache2 -maxdepth 1 -type d -atime +15 -exec rm -rf {} + 2>/dev/null || true
</code>

Running the task via TaskRun:

<code>apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: cron-clear-disk
  namespace: ops-system
spec:
  ref: cron-clear-disk
</code>

5. Copilot Design

Copilot is the current production form for using large models to handle ops incidents. It interacts via dialogue, first tackling later stages of incident handling and gradually moving earlier.

5.1 Key Steps

Ops project provides operational capabilities to Copilot.

Pipeline system offers scenario integration.

Step 1: the model selects an appropriate pipeline. Step 2: the model extracts parameters from the incident context. This resembles a

function_call

but invokes a

pipeline

instead of a function.

5.2 Pipeline Design

The pipeline aims to be easy for the model to recognize, extensible to cover more scenarios, and composable so the model can assemble new pipelines.

We have defined 95 tasks and 20 pipelines, all as CR objects describable in YAML.

Example input to the model:

<code>Please select the most appropriate option to classify the intention of the user.
Don't ask any more questions, just select the option.
Must be one of the following options:
- xxx-es-log-analysis(...)
- xxx-grafana-alert-node-disk-pressure(...)
- cluster-clear-disk(...)
- ...
</code>

5.3 Variable Design

Variables include default values, descriptions, regex, required flag, enums, examples, and fixed values. Priority order: task fixed > pipeline fixed > runtime extracted.

Well‑designed variables improve parameter accuracy, increase task success rate, and protect sensitive information.

Sample variable definition sent to the model:

<code>apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: cron-clear-disk
  namespace: ops-system
spec:
  crontab: 0 0 * * *
  ref: cron-clear-disk
</code>

6. Proactive Fault Discovery – Turning the Flywheel

Passive waiting for incidents leads to slow data accumulation; proactive inspection can surface potential issues before they cause outages.

Inspection covers device, driver, and system layers, and newly added nodes automatically join the inspection set.

7. Typical Case

AI accelerator cards often overheat, causing frequent failures that traditionally require on‑site repair.

Now, simply @‑mention Copilot in IM to trigger remediation.

Resolution time dropped from tens of minutes to a few minutes, and security improved by avoiding manual AK/SK exposure.

8. Summary

Incident handling timeline and the stages where large models can participate.

Large models can engage in discovery, response, localization, and remediation.

Start with the stage closest to resolution and gradually move the AI agent earlier.

Explore the Ops project for practical implementation.

function_call

is one approach; pipelines or workflows are equally viable.

Precise variable definitions are critical when building LLM‑driven applications.

AutomationOperationsLarge Language Modelsincident managementAI ops
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.