How AI‑Powered Ops‑Nexus Transforms Intelligent Operations for 100k+ Servers
This article details the design, technology choices, functional modules, core implementation, performance optimizations, and future roadmap of Ops‑Nexus, an AI‑driven intelligent operations platform that streamlines alarm analysis, log processing, and host health checks for large‑scale monitoring environments.
Design Background
The operations field is shifting from automation to intelligence, with large language models (LLMs) increasingly applied to alarm analysis, log inspection, and host health assessment. Managing over 100,000 machines creates three main challenges: alarm overload, low efficiency in extracting key issues from massive logs, and manual host health checks prone to omissions. To address these, the AI‑based smart operations module Ops‑Nexus was designed as a central platform that delivers structured outputs for alarms, logs, and host diagnostics, driven by LLM‑powered streaming responses.
Technical Selection
The system follows a layered modular architecture consisting of an interaction layer and a core engine layer.
Interaction layer: built with Spring WebFlux for non‑blocking APIs, enabling high‑concurrency performance and real‑time streaming responses.
Core engine: utilizes Spring AI ChatClient and PromptTemplate for dynamic prompt generation, allowing flexible AI model interactions based on business needs.
Functional Requirements and System Design
2.1 Functional Overview
The platform supports three primary functions:
Alarm Analysis: Input alarm data → root‑cause inference + repair suggestions.
Log Analysis: Input log content → anomaly detection + analysis report.
Host Health Check: Input host name → retrieve metrics/processes/SMART data + health report.
2.2 System Architecture Diagram
Core Module Implementation Details
3.1 Request Routing and Type Identification
Use enum
OpsServiceTypeto determine request type (ALERT, HOST, LOG, UNKNOWN).
ALERT("告警分析"),
HOST("主机健康检查"),
LOG("日志分析"),
UNKNOWN("未知");3.2 Data Collection Module
Tools‑calling mechanism retrieves real‑time data such as alarm data, host metrics, process list, and SMART data.
Supports field filtering (e.g., retain only processes with CPU ≥ 10%).
@Tool(description = "method named: getMetricByHostName . 获取指定主机的指标信息", returnDirect = true)
public String getMetricByHostName(@ToolParam(description = "主机名称") String host) {
String url = guanceUrl + GUANCE_GET_METRIC_TYPE;
String params = String.format(METRICS, host);
return apiInvoke(url, params);
}
@Tool(name = "getLogByHostName", description = "获取指定主机的异常日志")
public String getLogByHostName(@ToolParam(description = "主机名称") String host) {
return invokeHostApi(guanceUrl + GUANCE_GET_EXCEPTION_LOG, host);
}
@Tool(name = "getSmartDataByHostName", description = "获取指定主机的智能数据")
public String getSmartDataByHostName(@ToolParam(description = "主机名称") String host) {
return invokeHostApi(guanceUrl + GUANCE_GET_EXCEPTION_SMARTDATA, host);
}
@Tool(name = "getProcessInfoByHostName", description = "获取指定主机的进程信息")
public String getProcessInfoByHostName(@ToolParam(description = "主机名称") String host) {
return invokeHostApi(guanceUrl + GUANCE_GET_PROCESS_INFO, host);
}3.3 Prompt Engineering (Host Health Report Example)
Define a standardized prompt template with dynamic parameters such as
{host}and
{metrics}.
public static String HOST_ANALYSIS = """
You are an intelligent IT operations expert. Based on user input and host name, retrieve the following data:
- Metrics: <metrics> (tool: getMetricByHostName)
- Logs: <sysLog> (tool: getLogByHostName)
- SMART data: <smartData> (tool: getSmartDataByHostName)
Combine <userInput>, <metrics>, <processList> for multi‑dimensional analysis and generate a structured health report.
""";3.4 Functional Implementation and Demo
Core processing flow (simplified):
Flux<Record.ProcessStep> executeProcess(String sessionId, String userInput, String serviceType) {
ProcessContext context = sinkHandler.getOrCreateContext(sessionId);
ChatClient chatClient = chatClientFactory.initChatClient();
// Initialize steps, identify service type, extract host name via LLM, route to specific handlers (ALERT, HOST, LOG), emit streaming results, and mark completion.
}Demo screenshots illustrate host health checks, log analysis, and alarm analysis, showing detailed diagnostic reports and actionable suggestions.
Performance Optimization and Stability Assurance
4.1 Bottleneck Analysis
LLM inference latency due to large data retrieval.
Context isolation under high concurrency.
Buffering for streaming output to avoid data loss.
4.2 Optimization Measures
Prompt compression and data trimming to reduce token consumption.
Cache frequently accessed host data using Redis.
Set appropriate timeouts and interruption mechanisms for stability.
Future Evolution
5.1 Functional Expansion
Support additional ops scenarios such as network topology analysis and configuration change tracking.
Introduce Retrieval‑Augmented Generation (RAG) for FAQ knowledge.
Scheduled daily reports for trend monitoring.
5.2 Architectural Optimization
Adopt an Agent mode with a dynamic decision engine for self‑healing actions (e.g., process anomaly handling).
Conclusion
ops-nexusis an intelligent analysis engine for operations, unifying alarm, log, and host diagnostics. It demonstrates solid engineering practices in prompt engineering, streaming output, and model decision making, and will continue to expand its capabilities and architectural resilience.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.