Databases 10 min read

Introduction to ClickHouse and Step‑by‑Step Cluster Deployment Guide

This article provides a comprehensive overview of ClickHouse, covering its columnar OLAP architecture, key features such as data compression, vectorized processing, distributed query handling, and SQL support, followed by detailed step‑by‑step instructions for deploying a multi‑node ClickHouse cluster with MergeTree and ReplicatedMergeTree engines, configuration files, and Java MyBatis integration.

YunZhu Net Technology Team
YunZhu Net Technology Team
YunZhu Net Technology Team
Introduction to ClickHouse and Step‑by‑Step Cluster Deployment Guide

ClickHouse is a column‑oriented DBMS built for online analytical processing (OLAP). It achieves high performance through aggressive data compression, disk‑based storage that can also leverage SSDs and memory, multi‑core parallel query execution, and native support for distributed query processing across shards and replicas.

Key Features

Data Compression : Uses both generic and type‑specific codecs to reduce storage size while balancing CPU usage.

Disk Storage : Designed to run on traditional disks with lower cost per GB, yet can exploit SSDs and RAM when available.

Multi‑core Parallelism : Utilizes all available server resources to process large queries in parallel.

Distributed Processing : Data can be sharded; each shard has replicas for fault tolerance, and queries are executed transparently across all shards.

SQL Support : Implements a declarative SQL dialect compatible with most ANSI‑SQL features (GROUP BY, ORDER BY, JOIN, etc.). Sub‑queries and window functions are not yet supported.

Vector Engine : Processes data in vector (column‑segment) batches to maximize CPU efficiency.

Real‑time Updates : Primary‑key ordered storage in MergeTree enables fast incremental writes without locking.

Indexes : Primary‑key sorting provides fast look‑ups within milliseconds.

Online Queries : Low‑latency query execution without pre‑processing.

Approximate Computations : Offers aggregation functions, sample‑based queries, and random‑subset aggregations for faster results when exact precision is not required.

Cluster Deployment

The guide explains how to set up a three‑node ClickHouse cluster using the MergeTree and ReplicatedMergeTree engines.

MergeTree Engine

MergeTree stores data in ordered parts, supports partitioning, replication, and sampling, and merges parts in the background for efficient inserts.

ReplicatedMergeTree Engine

Extends MergeTree with built‑in replication at the table level, enabling automatic data synchronization across replicas.

Test Cluster Planning

Three physical machines (Ubuntu 18.04) are used, each running two ClickHouse instances (ports 9000 and 9001) to form three shards with two replicas each.

https://clickhouse.tech/docs/zh/introduction/distinctive-features

Configuration Files

Configuration files reside in /etc/clickhouse-server . The base file config.xml is copied to config1.xml for the second port. A shared include file clickhouse_self.xml defines the cluster topology and Zookeeper nodes.

<yandex>
  <clickhouse_remote_servers>
    <cluster_3shards_1replicas>
      <shard>
        <internal_replication>true</internal_replication>
        <replica>
          <host>172.16.1.x1</host>
          <port>9000</port>
          <user>default</user>
          <password>admin123</password>
        </replica>
        <replica>
          <host>172.16.1.x2</host>
          <port>9001</port>
          <user>default</user>
          <password>admin123</password>
        </replica>
      </shard>
      ... (additional shards and replicas) ...
    </cluster_3shards_1replicas>
  </clickhouse_remote_servers>
  <zookeeper-servers>
    <node index="1">
      <host>172.16.1.x1</host>
      <port>2181</port>
    </node>
    <node index="2">
      <host>172.16.1.x2</host>
      <port>2181</port>
    </node>
    <node index="3">
      <host>172.16.1.x3</host>
      <port>2181</port>
    </node>
  </zookeeper-servers>
</yandex>

Starting the Cluster

Each ClickHouse server is started; file‑permission issues may need to be resolved based on error messages.

Viewing Cluster Information

Use ClickHouse client commands to inspect cluster status and configuration.

Creating Replicated and Distributed Tables

Tables are created on a single node using distributed DDL; the cluster automatically propagates the schema.

Java MyBatis Integration

Example steps show how to insert data into a local table on node 1 and query the distributed table using MyBatis. The pom file, MyBatis configuration, LogEvent definition, Mapper interface, and XML mapping are illustrated.

After inserting data into one node, Zookeeper replicates it to the other replicas, ensuring high availability even if a node fails.

SQLClickHouseReplicationColumnar DatabaseCluster DeploymentMergeTree
YunZhu Net Technology Team
Written by

YunZhu Net Technology Team

Technical practice sharing from the YunZhu Net Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.