How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse
This article details the redesign of a company’s logging infrastructure—from an ELK‑based solution to a ClickHouse‑powered architecture—highlighting the motivations, key requirements, component choices, configuration examples, performance optimizations, and the resulting cost and storage benefits.
Background Introduction
Logs are essential for online troubleshooting and observability, requiring stability, performance, cost‑effectiveness, usability, and scalability. The existing ELK‑based system, with 8 ES clusters, over 100 machines, and 50+ Logstash nodes, faced data growth, slower processing, storage shortages, and high maintenance costs, prompting a search for a new architecture.
Comparison of Mainstream Log Platforms
Key Requirements for the New Log Platform
Support efficient aggregation queries across regions and tenants.
Reduce cost while handling ten‑times the current scale and improve reliability and operability.
Enable transparent migration from ELK without extensive changes and retain Kibana‑like interaction.
Provide a high‑performance collector and parallel processing to boost ingestion speed.
New Architecture
Log Collection – Log‑Pilot for Filebeat
Log‑Pilot runs in Kubernetes to collect container logs, offering easy deployment, multi‑source support, real‑time viewing, multiple outputs, and declarative configuration. It simplifies configuration but currently lacks active maintenance.
Log Parsing – Vector
Vector is a high‑performance observability data pipeline written in Rust. It collects, transforms, and routes logs, metrics, and traces, offering low resource usage, a custom DSL, and extensible plugins, making it ideal for large‑scale data streams.
<code># Sources
[sources.my_source_id]
type = "kafka"
bootstrap_servers = "10.x.x.1:9092,10.x.x.2:9092,10.x.x.3:9092"
group_id = "consumer-group-name"
topics = [ "^(prefix1|prefix2)-.+" ]
# Transforms (optional)
[transforms.my_transform_id]
type = "remap"
inputs = ["my_source_id"]
source = ". = parse_key_value!(.message)"
# Sinks – console output
[sinks.print]
type = "console"
inputs = ["my_transform_id"]
encoding.codec = "json"
# Sinks – ClickHouse
[sinks.my_sink_id]
type = "clickhouse"
inputs = ["my_transform_id"]
endpoint = "http://127.0.0.1:8123"
database = "default"
table = "table"
auth.strategy = "basic"
auth.user = "user"
auth.password = "password"
compression = "gzip"
skip_unknown_fields = true
</code>Important Points for Writing Vector Data to ClickHouse
Vector’s automatic topic balancing ensures roughly even data distribution.
Set appropriate batch size and write frequency (e.g., 100 k records or every 10 s) to limit parts and avoid “Too many parts” errors.
Use distributed tables to split data across servers for higher throughput and reliability.
Choose suitable partition keys to avoid excessive partitions.
Define primary keys and indexes to maintain order and improve queryability.
<code>batch.max_bytes = 2000000000 # max bytes per batch
batch.max_events = 100000 # max events per batch
batch.timeout_secs = 10 # max wait time for a batch
</code>Log Storage – ClickHouse
Reasons for Choosing ClickHouse
Higher write throughput compared to Elasticsearch.
Powerful single‑node large‑query capability.
Lower server cost.
More stable with lower operational overhead.
SQL syntax is simpler than ES DSL, reducing learning curve.
ClickHouse Cluster Planning
Consider data volume, ingestion rate, and real‑time requirements.
Assess query load, complexity, frequency, concurrency, and performance needs.
Plan for reliability, fault tolerance, monitoring, and maintenance.
Table Design Guidelines
Create indexes on frequently queried fields.
Select partition keys based on business scenarios.
Use appropriate MergeTree engine and sort keys aligned with queries.
Choose compression algorithms (e.g., LZ4 vs. ZSTD) balancing storage and query speed.
Creating Distributed Tables
<code># Create local table
CREATE TABLE [IF NOT EXISTS] db.local_table_name ON CLUSTER cluster (
name1 type1,
name2 type2,
...
INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = engine_name()
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SETTINGS name=value, ...];
</code> <code># Create distributed table
CREATE TABLE db.d_table_name ON CLUSTER cluster AS db.local_table_name ENGINE = Distributed(cluster, db, local_table_name [, sharding_key]);
</code>Visualization Analysis Platform
The team built a custom log visualization and query platform resembling Kibana/SLS to minimize migration cost, integrate with monitoring, alerting, and distributed tracing.
Provides query syntax highlighting, time‑distribution preview, and log snippet previews.
Monitoring and Alerting
ClickHouse exposes performance metrics (query time, memory, disk usage, connections) that can be scraped by Prometheus and visualized with Grafana.
Results
Integrating ClickHouse for server and Nginx logs cut total logging costs by 60% while storing 30% more log volume compared to the previous ELK setup.
Future Plans
Support SQL‑based query services.
Fine‑tune indexes with PreWhere/Where clauses and jump‑index strategies.
Implement hot‑cold tiered storage to improve retention and reduce cost.
Summary
Migrating logs from Elasticsearch to ClickHouse saves server resources and lowers overall operational cost.
Optimized log query performance unlocks greater value for log analytics.
Nevertheless, Elasticsearch remains indispensable for certain use cases.
Inke Technology
Official account of Inke Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.