Big Data 15 min read

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

This article compares Elasticsearch and ClickHouse in terms of write throughput, query speed, and server cost, then provides a step‑by‑step deployment guide for a private data pipeline using Zookeeper, Kafka, FileBeat, and ClickHouse, along with common issues and their solutions.

Code Ape Tech Column

Jan 3, 2023

The author explains the need for a private data‑analysis platform to address security and cost concerns of SaaS services, and chooses ClickHouse as a more efficient alternative to Elasticsearch for log storage and analysis.

Elasticsearch vs ClickHouse

ClickHouse offers significantly higher write throughput (50‑200 MB/s per server, over 600 k records/s, >5× ES) and fewer write rejections, resulting in more reliable ingestion.

Query performance is also superior; with data cached in page‑cache, ClickHouse can achieve 2‑30 GB/s, 5‑30× faster than Elasticsearch.

Because ClickHouse compresses data more aggressively (1/3 – 1/30 of ES disk usage) it reduces storage, I/O, and CPU consumption, cutting server costs roughly in half.

Cost Analysis

A cost comparison on Alibaba Cloud (Aliyun) without discounts shows ClickHouse to be markedly cheaper than Elasticsearch for the same workload.

Environment Deployment

1. Zookeeper Cluster Deployment

yum install java-1.8.0-openjdk-devel.x86_64
/etc/profile 配置环境变量
更新系统时间
yum install  ntpdate
ntpdate asia.pool.ntp.org

mkdir zookeeper
mkdir ./zookeeper/data
mkdir ./zookeeper/logs
wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.7.1/apache-zookeeper-3.7.1-bin.tar.gz
tar -zvxf apache-zookeeper-3.7.1-bin.tar.gz -C /usr/zookeeper

export ZOOKEEPER_HOME=/usr/zookeeper/apache-zookeeper-3.7.1-bin
export PATH=$ZOOKEEPER_HOME/bin:$PATH

cd $ZOOKEEPER_HOME/conf
vi zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/zookeeper/data
dataLogDir=/usr/zookeeper/logs
clientPort=2181
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888

# create myid on each node
echo "1" > /usr/zookeeper/data/myid
# on other nodes use 2 and 3 respectively

cd $ZOOKEEPER_HOME/bin
sh zkServer.sh start

2. Kafka Cluster Deployment

mkdir -p /usr/kafka
chmod 777 -R /usr/kafka
wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/kafka/3.2.0/kafka_2.12-3.2.0.tgz
tar -zvxf kafka_2.12-3.2.0.tgz -C /usr/kafka

# broker configuration (example)
broker.id=1
listeners=PLAINTEXT://ip:9092
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dir=/usr/kafka/logs
num.partitions=5
num.recovery.threads.per.data.dir=3
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=3
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
zookeeper.connection.timeout.ms=30000
group.initial.rebalance.delay.ms=0

nohup /usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-start.sh /usr/kafka/kafka_2.12-3.2.0/config/server.properties > /usr/kafka/logs/kafka.log 2>&1 &
/usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-stop.sh
KAFKA_HOME/bin/kafka-topics.sh --list --bootstrap-server ip:9092
KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server ip:9092 --topic test --from-beginning
KAFKA_HOME/bin/kafka-topics.sh --create --bootstrap-server ip:9092 --replication-factor 2 --partitions 3 --topic xxx_data

3. FileBeat Deployment

sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
# create elastic.repo in /etc/yum.repos.d/
[elastic-8.x]
name=Elastic repository for 8.x packages
baseurl=https://artifacts.elastic.co/packages/8.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

yum install filebeat
systemctl enable filebeat
chkconfig --add filebeat

# filebeat.yml (excerpt)
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /root/logs/xxx/inner/*.log
  json:
    keys_under_root: true
output.kafka:
  hosts: ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
  topic: 'xxx_data_clickhouse'
  compression: gzip
processors:
  - drop_fields:
      fields: ["input", "agent", "ecs", "log", "metadata", "timestamp"]

nohup ./filebeat -e -c /etc/filebeat/filebeat.yml > /user/filebeat/filebeat.log &

4. ClickHouse Deployment

# Verify SSE4.2 support
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

mkdir -p /data/clickhouse
# add host entries for clickhouse nodes
10.190.85.92 bigdata-clickhouse-01
10.190.85.93 bigdata-clickhouse-02

# performance tuning
echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 0 | tee /proc/sys/vm/overcommit_memory
echo 'never' | tee /sys/kernel/mm/transparent_hugepage/enabled

# install ClickHouse repository
yum install yum-utils
rpm --import <https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG>
yum-config-manager --add-repo <https://repo.clickhouse.tech/rpm/stable/x86_64>
yum -y install clickhouse-server clickhouse-client

# adjust config.xml (set <level>information</level>)
# logs: /var/log/clickhouse-server/clickhouse-server.log
# error log: /var/log/clickhouse-server/clickhouse-server.err.log

clickhouse-server --version
clickhouse-client --password

sudo clickhouse stop
sudo clickhouse start

ClickHouse Table Creation and Common Issues

Creating a Kafka engine table:

CREATE TABLE default.kafka_clickhouse_inner_log ON CLUSTER clickhouse_cluster (
    log_uuid   String,
    date_partition   UInt32,
    event_name   String,
    activity_name   String,
    activity_type   String,
    activity_id   UInt16
) ENGINE = Kafka SETTINGS
    kafka_broker_list = 'kafka1:9092,kafka2:9092,kafka3:9092',
    kafka_topic_list = 'data_clickhouse',
    kafka_group_name = 'clickhouse_xxx',
    kafka_format = 'JSONEachRow',
    kafka_row_delimiter = '
',
    kafka_num_consumers = 1;

Issue 1 – direct SELECT on Kafka engine is blocked:

Direct select is not allowed. To enable use setting stream_like_engine_allow_direct_select.(QUERY_NOT_ALLOWED)

Solution:

clickhouse-client --stream_like_engine_allow_direct_select 1 --password xxxxx

Issue 2 – creating a local replicated table fails because macro 'shard' is missing.

DB::Exception: No macro 'shard' in config while processing substitutions ... (SYNTAX_ERROR)

Solution: define distinct on each node, e.g.:

<macros>
    <shard>01</shard>
    <replica>example01-01-1</replica>
</macros>

Issue 3 – replica already exists error when creating the replicated table.

DB::Exception: Replica ... already exists. (REPLICA_IS_ALREADY_EXIST)

Solution: delete the corresponding Zookeeper node and recreate the table.

Creating a distributed (cluster) table:

CREATE TABLE default.bi_inner_log_all ON CLUSTER clickhouse_cluster AS default.bi_inner_log_local
ENGINE = Distributed(clickhouse_cluster, default, bi_inner_log_local, xxHash32(log_uuid));

Issue 4 – authentication failure when querying the distributed table.

Authentication failed: password is incorrect or there is no user with such name. (AUTHENTICATION_FAILED)

Solution: add correct and entries in the configuration:

<remote_servers>
    <clickhouse_cluster>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>ip1</host>
                <port>9000</port>
                <user>default</user>
                <password>xxxx</password>
            </replica>
        </shard>
        ...
    </clickhouse_cluster>
</remote_servers>

Creating a materialized view to sync Kafka data to the distributed table:

CREATE MATERIALIZED VIEW default.view_bi_inner_log ON CLUSTER clickhouse_cluster TO default.bi_inner_log_all AS
SELECT
    log_uuid,
    date_partition,
    event_name,
    activity_name,
    credits_bring,
    activity_type,
    activity_id
FROM default.kafka_clickhouse_inner_log;

After resolving these issues, the end‑to‑end data flow from log generation through Kafka, FileBeat, and ClickHouse works smoothly, providing a cost‑effective, high‑performance analytics pipeline.

Conclusion: Refer to official documentation or --help for troubleshooting; systematic problem solving leads to deeper expertise.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Big Data data pipeline Deployment Elasticsearch ClickHouse cost analysis

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.