Operations 12 min read

Automating Easysearch Cluster Alerts and Root‑Cause Analysis with AIOps – Full Implementation Guide

This article walks through a practical AIOps solution that replaces brittle keyword rules for Easysearch Elasticsearch clusters with a three‑step pipeline—Filebeat log ingestion, Flask‑driven LLM analysis, and automated email alerts plus ES feedback—detailing configuration, code, pitfalls, and suitability.

Mingyi World Elasticsearch
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Automating Easysearch Cluster Alerts and Root‑Cause Analysis with AIOps – Full Implementation Guide

Why Rule‑Based Alerts Fail

Traditional Easysearch monitoring relies on static keyword rules (e.g., matching "OutOfMemoryError" for severe alerts or "circuit_breaking" for warnings). After a couple of weeks operators encounter three major issues: excessive false positives from deprecation warnings and transient network glitches, missed alerts when error messages change format, and alerts that provide no root‑cause insight, leaving operators to manually parse _cluster/health and logs.

Overall Approach: Three‑Step AIOps Pipeline

The solution eliminates keyword matching by letting a large language model read logs, assess real risk, and suggest root causes, so humans only review conclusions.

Step 1 – Log Ingestion

Filebeat ships logs to an Elasticsearch index named easysearch-logs-{{%+yyyy.MM.dd}}. The .env file defines two required fields: message stores the raw log text. @timestamp stores the UTC ingestion time (converted to UTC+8 for display).

Sample Filebeat snippet:

output.elasticsearch:
  index: "easysearch-logs-{{%+yyyy.MM.dd}}"
setup.template.pattern: "easysearch-logs-*"

Corresponding .env settings:

ES_INDEX=easysearch-logs-*
ES_SCHEME=https
ES_VERIFY_CERTS=false   # self‑signed cert in internal network

Step 2 – Feeding Logs to the Model

The Flask service periodically pulls recent error/warn logs. Because the default scan only covers the last five minutes, a three‑level fallback strategy is used:

1️⃣ Scan recent minutes for errors/warns.

2️⃣ If none, expand to the full time range.

3️⃣ If still empty, fetch the latest log sample for the model.

The fallback logic is implemented in the _search_es_logs function, which only returns an HTTP‑level failure when ES is unreachable; otherwise it follows a "no‑anomaly, skip analysis" branch.

Cluster health information ( _cluster/health) is bundled with the logs so the model can distinguish node failures from shard migrations. Example prompt construction:

health_text = json.dumps(cluster_health, ensure_ascii=False, indent=2)
user_content = f"""## Cluster health
{health_text}

## Recent log snippets (total {len(logs)} entries)
{log_text}

Please analyze the above logs, decide if an alert is needed, and return a JSON result."""

The prompt hard‑codes several judgment principles (focus on business impact, differentiate "sporadic errors" from "continuous degradation", ignore harmless deprecation warnings, and prioritize disk watermark, OOM/GC pressure, node loss, shard issues, circuit‑breaker trips). The model is forced to output structured JSON via response_format: json_object and a low temperature ( temperature=0.1).

Step 3 – Using the Analysis Results

If the model returns need_alert: true, an email is sent via a 163.com SMTP server (using an app‑specific authorization code). The alert payload is also written back to Elasticsearch for persistence, using four dedicated indices: es-aiops-config – stores UI‑saved connection and alert settings. es-aiops-alerts – records every analysis result, including non‑alerted cases. es-aiops-scan-logs – logs each scan step in detail. es-aiops-stats – holds cumulative statistics ( stats-current-v1) and per‑scan snapshots ( snapshot-{scan_id}).

This creates a closed data loop: logs enter ES, analysis results return to ES, and operators can trace the entire investigation without external storage.

Scanning Modes

The code distinguishes manual scans (triggered from the web UI) and scheduled scans (looping with a sleep interval, enabled by AUTO_SCAN=true). Both invoke the same run_scan() function, ensuring consistent behavior.

Quick Start

# 1. Deploy Filebeat on each node, output to easysearch-logs-*
# 2. Copy .env.example to .env and fill in ES credentials, DeepSeek key, and 163 SMTP auth code
# 3. Install dependencies and start the service
pip install -r requirements.txt
python app.py
# 4. Verify via the web UI at http://127.0.0.1:5001
#    • Save config → data appears in es-aiops-config
#    • Click "Immediate Scan" → data appears in es-aiops-alerts and es-aiops-scan-logs
#    • Check "Statistics History" for snapshots

Common Pitfalls

Timezone : @timestamp is stored in UTC; display conversion to UTC+8 must be applied consistently. Do not extract timestamps from the log message itself.

Pagination : The first page can merge log and node file results; subsequent pages must use pure ES pagination. Avoid slicing already‑paginated data in Python.

Log Deletion : Only allow deletion of documents in whitelisted indices; system indices (starting with .) are rejected. Use refresh=wait_for to make deletions visible immediately.

Config Persistence : Prefer storing configuration in the es-aiops-config index; fall back to local .aiops_config.json if ES write fails. Empty password fields are left unchanged to avoid overwriting saved secrets.

Who Should Use This

Suitable for single‑cluster, internal deployments with moderate log volume where a small ops team wants to minimize rule writing and focus on model‑generated recommendations. Not suitable for multi‑tenant SaaS platforms or sub‑second streaming alerting, which require Flink‑based AIOps pipelines.

Future Enhancements (Priority Order)

Alert deduplication – merge similar summaries within N minutes and throttle email frequency.

API authentication – front‑end Flask behind Nginx Basic Auth, keep SMTP credentials in ES.

Prompt customization per business – maintain separate SYSTEM_PROMPT for search‑oriented vs log‑oriented clusters.

One‑Line Takeaway

When rule‑based monitoring can’t catch anomalies, let a large language model evaluate the logs and provide root‑cause recommendations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

elasticsearchDeepSeekFlaskAIOpsLog Monitoringfilebeat
Mingyi World Elasticsearch
Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.