Operations 13 min read

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

This article details the redesign of a company’s logging infrastructure—from an ELK‑based solution to a ClickHouse‑powered architecture—highlighting the motivations, key requirements, component choices, configuration examples, performance optimizations, and the resulting cost and storage benefits.

Inke Technology
Inke Technology
Inke Technology
How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

Background Introduction

Logs are essential for online troubleshooting and observability, requiring stability, performance, cost‑effectiveness, usability, and scalability. The existing ELK‑based system, with 8 ES clusters, over 100 machines, and 50+ Logstash nodes, faced data growth, slower processing, storage shortages, and high maintenance costs, prompting a search for a new architecture.

Comparison of Mainstream Log Platforms

Key Requirements for the New Log Platform

Support efficient aggregation queries across regions and tenants.

Reduce cost while handling ten‑times the current scale and improve reliability and operability.

Enable transparent migration from ELK without extensive changes and retain Kibana‑like interaction.

Provide a high‑performance collector and parallel processing to boost ingestion speed.

New Architecture

Log Collection – Log‑Pilot for Filebeat

Log‑Pilot runs in Kubernetes to collect container logs, offering easy deployment, multi‑source support, real‑time viewing, multiple outputs, and declarative configuration. It simplifies configuration but currently lacks active maintenance.

Log Parsing – Vector

Vector is a high‑performance observability data pipeline written in Rust. It collects, transforms, and routes logs, metrics, and traces, offering low resource usage, a custom DSL, and extensible plugins, making it ideal for large‑scale data streams.

<code># Sources
[sources.my_source_id]
  type = "kafka"
  bootstrap_servers = "10.x.x.1:9092,10.x.x.2:9092,10.x.x.3:9092"
  group_id = "consumer-group-name"
  topics = [ "^(prefix1|prefix2)-.+" ]

# Transforms (optional)
[transforms.my_transform_id]
  type = "remap"
  inputs = ["my_source_id"]
  source = ". = parse_key_value!(.message)"

# Sinks – console output
[sinks.print]
  type = "console"
  inputs = ["my_transform_id"]
  encoding.codec = "json"

# Sinks – ClickHouse
[sinks.my_sink_id]
  type = "clickhouse"
  inputs = ["my_transform_id"]
  endpoint = "http://127.0.0.1:8123"
  database = "default"
  table = "table"
  auth.strategy = "basic"
  auth.user = "user"
  auth.password = "password"
  compression = "gzip"
  skip_unknown_fields = true
</code>

Important Points for Writing Vector Data to ClickHouse

Vector’s automatic topic balancing ensures roughly even data distribution.

Set appropriate batch size and write frequency (e.g., 100 k records or every 10 s) to limit parts and avoid “Too many parts” errors.

Use distributed tables to split data across servers for higher throughput and reliability.

Choose suitable partition keys to avoid excessive partitions.

Define primary keys and indexes to maintain order and improve queryability.

<code>batch.max_bytes = 2000000000   # max bytes per batch
batch.max_events = 100000      # max events per batch
batch.timeout_secs = 10        # max wait time for a batch
</code>

Log Storage – ClickHouse

Reasons for Choosing ClickHouse

Higher write throughput compared to Elasticsearch.

Powerful single‑node large‑query capability.

Lower server cost.

More stable with lower operational overhead.

SQL syntax is simpler than ES DSL, reducing learning curve.

ClickHouse Cluster Planning

Consider data volume, ingestion rate, and real‑time requirements.

Assess query load, complexity, frequency, concurrency, and performance needs.

Plan for reliability, fault tolerance, monitoring, and maintenance.

Table Design Guidelines

Create indexes on frequently queried fields.

Select partition keys based on business scenarios.

Use appropriate MergeTree engine and sort keys aligned with queries.

Choose compression algorithms (e.g., LZ4 vs. ZSTD) balancing storage and query speed.

Creating Distributed Tables

<code># Create local table
CREATE TABLE [IF NOT EXISTS] db.local_table_name ON CLUSTER cluster (
    name1 type1,
    name2 type2,
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = engine_name()
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SETTINGS name=value, ...];
</code>
<code># Create distributed table
CREATE TABLE db.d_table_name ON CLUSTER cluster AS db.local_table_name ENGINE = Distributed(cluster, db, local_table_name [, sharding_key]);
</code>

Visualization Analysis Platform

The team built a custom log visualization and query platform resembling Kibana/SLS to minimize migration cost, integrate with monitoring, alerting, and distributed tracing.

Provides query syntax highlighting, time‑distribution preview, and log snippet previews.

Monitoring and Alerting

ClickHouse exposes performance metrics (query time, memory, disk usage, connections) that can be scraped by Prometheus and visualized with Grafana.

Results

Integrating ClickHouse for server and Nginx logs cut total logging costs by 60% while storing 30% more log volume compared to the previous ELK setup.

Future Plans

Support SQL‑based query services.

Fine‑tune indexes with PreWhere/Where clauses and jump‑index strategies.

Implement hot‑cold tiered storage to improve retention and reduce cost.

Summary

Migrating logs from Elasticsearch to ClickHouse saves server resources and lowers overall operational cost.

Optimized log query performance unlocks greater value for log analytics.

Nevertheless, Elasticsearch remains indispensable for certain use cases.

Big DataObservabilityKubernetesClickHouseLoggingVector
Inke Technology
Written by

Inke Technology

Official account of Inke Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.