Comprehensive EFLK (Elasticsearch, Filebeat, Logstash, Kibana) Deep Inspection and Monitoring Guide
This comprehensive guide details a step‑by‑step deep‑inspection and monitoring strategy for an Elasticsearch‑Filebeat‑Logstash‑Kibana (EFLK) stack, covering cluster health, node and shard metrics, index status, query profiling, Filebeat, Logstash and Kibana validation, DSL query examples, and a Python script for automated metric collection.
Ensuring the stable operation of an Elasticsearch‑Filebeat‑Logstash‑Kibana (EFLK) stack is critical for both operations and big‑data environments. This guide presents a thorough deep‑inspection plan covering health checks, performance metrics, shard status, index health, and query profiling for each component.
1. Elasticsearch Deep Inspection
1.1 Cluster Health Check
Use the GET _cluster/health API to retrieve overall cluster health. Key fields to monitor are status (green/yellow/red), number_of_nodes , active_primary_shards , active_shards , and unassigned_shards .
1.2 Node Performance Monitoring
Query node statistics with GET _nodes/stats . Important metrics include indices.docs.count , indices.store.size_in_bytes , jvm.mem.heap_used_percent , os.cpu.percent , and fs.total.available_in_bytes .
1.3 Shard Status Monitoring
List shard details via GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason . Pay special attention to the unassigned.reason field; if shards are unassigned, reallocate them with:
POST /_cluster/reroute { "commands": [ { "allocate_stale_primary": { "index": "your-index", "shard": 0, "node": "node-name", "accept_data_loss": true } } ] }
1.4 Index Status Inspection
Check index health using GET _cat/indices?v&h=index,health,status,pri,rep,docs.count,store.size . Monitor health , status , pri , rep , docs.count , and store.size .
1.5 Cluster Performance Analysis (Profile Query)
Enable the profile parameter in search requests to obtain per‑phase execution times, e.g.:
GET /your-index/_search?pretty { "profile": true, "query": { "match": { "field": "value" } } }
The response highlights the most time‑consuming stages, helping to pinpoint bottlenecks.
2. Filebeat Inspection
2.1 Configuration Check
Verify that Filebeat is installed and running ( systemctl status filebeat ) and that /etc/filebeat/filebeat.yml contains correct input sources and output destinations (Elasticsearch or Logstash).
2.2 Log Examination
Tail the Filebeat log ( tail -f /var/log/filebeat/filebeat ) to ensure no connection errors or permission issues.
2.3 Configuration Test
Validate the config with filebeat test config and run Filebeat in foreground for debugging ( filebeat -e ).
3. Logstash Inspection
3.1 Process Check
Confirm Logstash is active via systemctl status logstash .
3.2 Pipeline Configuration Review
Inspect pipeline files under /etc/logstash/conf.d/ to ensure proper input, filter, and output sections.
3.3 Log Review
Tail Logstash logs ( tail -f /var/log/logstash/logstash-plain.log ) and look for connection failures or Grok parsing errors.
4. Kibana Inspection
4.1 Process Status
Check Kibana service health with systemctl status kibana .
4.2 Configuration Validation
Review /config/kibana.yml , ensuring elasticsearch.hosts points to the cluster and server.host is set to 0.0.0.0 for external access.
4.3 Log Examination
Inspect Kibana logs ( tail -f logs/kibana.log ) for connection issues or startup failures.
4.4 UI Verification
Log into Kibana and confirm that Discover, Dashboards, and Visualizations load correctly.
5. DSL Query Examples
5.1 Slow‑Query Log
Find slow queries in the last day:
GET /_search { "query": { "range": { "@timestamp": { "gte": "now-1d/d", "lt": "now/d" } } }, "sort": [ { "took": { "order": "desc" } } ], "size": 10 }
5.2 Error Log Search
Search for error messages across Filebeat indices:
GET /filebeat-*/_search { "query": { "match": { "message": "error" } } }
5.3 Node‑Specific Logs
Retrieve logs for a particular node:
GET /_search { "query": { "term": { "host.name": { "value": "node-1" } } } }
6. Enterprise Automation: Python Metrics Collector
A Python script (shown below) uses the elasticsearch client to gather cluster health, node CPU, load average, memory, JVM heap, and disk usage, then writes the data as JSON to a log file.
import json from datetime import datetime import configparser import warnings from elasticsearch import Elasticsearch warnings.filterwarnings("ignore") def init_es_client(config_path='./conf/config.ini'): """Initialize and return an Elasticsearch client""" cfg = configparser.ConfigParser() cfg.read(config_path) es_host = cfg.get('elasticsearch', 'ES_HOST') es_user = cfg.get('elasticsearch', 'ES_USER') es_password = cfg.get('elasticsearch', 'ES_PASSWORD') return Elasticsearch(hosts=[es_host], basic_auth=(es_user, es_password), verify_certs=False, ca_certs='conf/http_ca.crt') LOG_FILE = 'elasticsearch_metrics.log' es = init_es_client() def get_cluster_health(): return es.cluster.health().body def get_node_stats(): return es.nodes.stats().body def get_cluster_metrics(): metrics = {} health = get_cluster_health() metrics['cluster_health'] = health node_stats = get_node_stats().get('nodes', {}) metrics['nodes'] = {} for nid, info in node_stats.items(): name = info.get('name') metrics['nodes'][name] = { 'cpu_usage': info['os']['cpu']['percent'], 'load_average': info['os']['cpu'].get('load_average', {}).get('1m'), 'memory_used': info['os']['mem']['used_percent'], 'heap_used': info['jvm']['mem']['heap_used_percent'], 'disk_available': info['fs']['total']['available_in_bytes'] / (1024**3), 'disk_total': info['fs']['total']['total_in_bytes'] / (1024**3), 'disk_usage_percent': 100 - (info['fs']['total']['available_in_bytes'] * 100 / info['fs']['total']['total_in_bytes']) } return metrics def log_metrics(): metrics = get_cluster_metrics() ts = datetime.now().strftime('%Y-%m-%d %H:%M:%S') with open(LOG_FILE, 'a') as f: f.write(f"Timestamp: {ts}\n") f.write(json.dumps(metrics, indent=4)) f.write('\n\n') if __name__ == "__main__": log_metrics() print("Elasticsearch cluster metrics logged successfully.")
The script can be scheduled with cron to run daily at 06:00:
0 6 * * * /usr/bin/python3 /home/user/scripts/es_metrics.py >> /home/user/scripts/es_metrics_cron.log 2>&1
7. Conclusion
By following this deep‑inspection framework and automating metric collection, teams can maintain full visibility into the health and performance of the EFLK stack, quickly detect issues, and ensure reliable operation. Integrating monitoring tools such as Prometheus, Zabbix, or Grafana further enhances observability.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.