Big Data 10 min read

Design and Optimization Strategies for Querying 100K Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

This article examines a business requirement to filter up to 100,000 items from a pool of tens of millions, presenting and evaluating four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—along with performance data and implementation details.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Optimization Strategies for Querying 100K Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

Introduction: the business need is to select no more than 100 K records from a pool of tens of millions, then sort and disperse them according to configurable weight rules.

Initial design proposes four schemes: (1) multithreaded ClickHouse (CK) pagination, (2) Elasticsearch scroll‑scan deep pagination, (3) an ES + HBase combined query, and (4) a RediSearch + RedisJSON combined query.

CK pagination implementation uses multithreading; the core code is shown below:

int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List
> result = Lists.newArrayList();
List
>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
    selectionQueryCondition.setPageSize(pageSize);
    selectionQueryCondition.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future
>> future : futureList) {
    List
> queryRes = future.get(20, TimeUnit.SECONDS);
    if (CollectionUtils.isNotEmpty(queryRes)) {
        result.addAll(queryRes);
    }
}

CK pagination relies on LIMIT #{limitStart}, #{limitEnd} , which leads to performance issues on deep pagination (worst‑case 10–18 s for a 10 M pool).

Elasticsearch scroll‑scan deep pagination is introduced; several ES paging methods are listed (from + size, scroll, scroll‑scan, search‑after). Performance charts show ES excels when the result set is below ~30 K but degrades for larger sets.

The ES + HBase hybrid uses ES solely for filtering (returning only sku_id and doc_id ) and then fetches full fields from HBase via rowKey lookup, achieving 3–6 s worst‑case latency on a 10 M pool.

RediSearch + RedisJSON is described with benchmark data: indexing 5.6 M documents takes 221 s (vs 349 s for ES), throughput reaches 12.5 K ops/s (vs 3.1 K ops/s for ES), and latency is 8 ms (vs 10 ms for ES). Additional read/write tests show RedisJSON outperforms MongoDB and ES by large margins.

Conclusion: the four approaches are compared—multithreaded CK (10–18 s), single‑thread ES scroll‑scan (no clear gain), ES + HBase (3–6 s), and RediSearch + RedisJSON (to be measured). The article highlights trade‑offs between system complexity, storage duplication, and query latency.

ElasticsearchQuery OptimizationClickHouseHBaselarge-scale dataRedisJSONRediSearch
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.