Performance Comparison of Elasticsearch and ClickHouse for Log Analytics
This article compares Elasticsearch and ClickHouse as log analytics solutions, detailing their architectures, deployment configurations, query capabilities, and performance benchmarks across various query types, and demonstrates that ClickHouse generally outperforms Elasticsearch in speed and aggregation efficiency.
Elasticsearch is a real‑time distributed search and analytics engine built on Lucene, often used together with Logstash and Kibana (the ELK stack) for end‑to‑end log analysis. ClickHouse, developed by Yandex, is a column‑oriented relational DBMS designed for OLAP workloads and has become very popular in the big‑data space.
In recent years many companies (e.g., Ctrip, Kuaishou) have begun migrating log workloads from Elasticsearch to ClickHouse due to performance advantages.
Architecture and Design Comparison
Elasticsearch relies on inverted indexes and Bloom filters to solve search problems, using sharding and replica mechanisms for scalability and high availability. Its nodes can be classified as:
Client Node : handles API and data access, does not store data.
Data Node : stores and indexes data.
Master Node : coordinates the cluster, does not store data.
ClickHouse follows an MPP architecture with each node responsible for a portion of the data processing. It stores data column‑wise, uses compression, sparse indexes, and SIMD instructions for fast computation, and relies on ZooKeeper for node coordination.
Test Stacks
Four Docker‑Compose stacks were built:
ES stack : single‑node Elasticsearch container and a Kibana container.
ClickHouse stack : single‑node ClickHouse container and TabixUI as a client.
Data import stack : Vector.dev (similar to Fluentd) generates syslog data and feeds both stacks.
Test control stack : Jupyter notebooks using Python SDKs for Elasticsearch and ClickHouse to run queries.
Deployment files:
version: '3.7'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0
container_name: elasticsearch
environment:
- xpack.security.enabled=false
- discovery.type=single-node
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
cap_add:
- IPC_LOCK
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
ports:
- 9200:9200
- 9300:9300
deploy:
resources:
limits:
cpus: '4'
memory: 4096M
reservations:
memory: 4096M
kibana:
container_name: kibana
image: docker.elastic.co/kibana/kibana:7.4.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- 5601:5601
depends_on:
- elasticsearch
volumes:
elasticsearch-data:
driver: local version: "3.7"
services:
clickhouse:
container_name: clickhouse
image: yandex/clickhouse-server
volumes:
- ./data/config:/var/lib/clickhouse
ports:
- "8123:8123"
- "9000:9000"
- "9009:9009"
- "9004:9004"
ulimits:
nproc: 65535
nofile:
soft: 262144
hard: 262144
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
interval: 30s
timeout: 5s
retries: 3
deploy:
resources:
limits:
cpus: '4'
memory: 4096M
reservations:
memory: 4096M
tabixui:
container_name: tabixui
image: spoonest/clickhouse-tabix-web-client
environment:
- CH_NAME=dev
- CH_HOST=127.0.0.1:8123
- CH_LOGIN=default
ports:
- "18080:80"
depends_on:
- clickhouse
deploy:
resources:
limits:
cpus: '0.1'
memory: 128M
reservations:
memory: 128MA ClickHouse table for syslog data was created:
CREATE TABLE default.syslog(
application String,
hostname String,
message String,
mid String,
pid String,
priority Int16,
raw String,
timestamp DateTime('UTC'),
version Int16
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
TTL timestamp + toIntervalMonth(1);Vector pipeline configuration (vector.toml) defines sources, transforms, and sinks to generate 100k syslog records and send them to both Elasticsearch and ClickHouse:
[sources.in]
type = "generator"
format = "syslog"
interval = 0.01
count = 100000
[transforms.clone_message]
type = "add_fields"
inputs = ["in"]
fields.raw = "{{ message }}"
[transforms.parser]
type = "regex_parser"
inputs = ["clone_message"]
field = "message"
patterns = ['^(?
\d*)>(?
\d) (?
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (?
\w+\.\w+) (?
\w+) (?
\d+) (?
ID\d+) - (?
.*)$']
[transforms.coercer]
type = "coercer"
inputs = ["parser"]
types.timestamp = "timestamp"
types.version = "int"
types.priority = "int"
[sinks.out_console]
type = "console"
inputs = ["coercer"]
target = "stdout"
encoding.codec = "json"
[sinks.out_clickhouse]
type = "clickhouse"
inputs = ["coercer"]
host = "http://host.docker.internal:8123"
table = "syslog"
encoding.only_fields = ["application","hostname","message","mid","pid","priority","raw","timestamp","version"]
encoding.timestamp_format = "unix"
[sinks.out_es]
type = "elasticsearch"
inputs = ["coercer"]
endpoint = "http://host.docker.internal:9200"
index = "syslog-%F"
compression = "none"
healthcheck.enabled = trueQuery Comparison
Both systems were queried using equivalent statements (JSON DSL for ES, SQL for ClickHouse) covering match‑all, single‑field, multi‑field, term, range, exists, regex, and aggregation scenarios. Example queries:
# ES match_all
{ "query": { "match_all": {} } }
# ClickHouse
SELECT * FROM syslog; # ES term query
{ "query": { "term": { "message": "pretty" } } }
# ClickHouse
SELECT * FROM syslog WHERE lowerUTF8(raw) LIKE '%pretty%';Performance tests were run ten times per query using Python SDKs, and response time distributions were plotted.
The results show ClickHouse consistently outperforms Elasticsearch in most query types, especially aggregations, where columnar storage provides a clear advantage. Even for regex and term queries, ClickHouse remains competitive.
Conclusion
This article demonstrates that ClickHouse delivers superior performance for log analytics workloads compared to Elasticsearch, explaining why many organizations are transitioning to ClickHouse for such scenarios. While Elasticsearch offers richer query features, the basic queries tested here highlight ClickHouse’s efficiency.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.