Big Data 10 min read

Design and Optimization Strategies for Querying 100K Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

This article examines a business requirement to filter up to 100,000 items from a pool of tens of millions, presenting and evaluating four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—along with performance data and implementation details.

Architecture Digest

Jun 7, 2022

Design and Optimization Strategies for Querying 100K Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

Introduction: the business need is to select no more than 100 K records from a pool of tens of millions, then sort and disperse them according to configurable weight rules.

Initial design proposes four schemes: (1) multithreaded ClickHouse (CK) pagination, (2) Elasticsearch scroll‑scan deep pagination, (3) an ES + HBase combined query, and (4) a RediSearch + RedisJSON combined query.

CK pagination implementation uses multithreading; the core code is shown below:

int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List<Map<String,Object>> result = Lists.newArrayList();
List<Future<List<Map<String,Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
    selectionQueryCondition.setPageSize(pageSize);
    selectionQueryCondition.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future<List<Map<String,Object>>> future : futureList) {
    List<Map<String,Object>> queryRes = future.get(20, TimeUnit.SECONDS);
    if (CollectionUtils.isNotEmpty(queryRes)) {
        result.addAll(queryRes);
    }
}

CK pagination relies on LIMIT #{limitStart}, #{limitEnd}, which leads to performance issues on deep pagination (worst‑case 10–18 s for a 10 M pool).

Elasticsearch scroll‑scan deep pagination is introduced; several ES paging methods are listed (from + size, scroll, scroll‑scan, search‑after). Performance charts show ES excels when the result set is below ~30 K but degrades for larger sets.

The ES + HBase hybrid uses ES solely for filtering (returning only sku_id and doc_id) and then fetches full fields from HBase via rowKey lookup, achieving 3–6 s worst‑case latency on a 10 M pool.

RediSearch + RedisJSON is described with benchmark data: indexing 5.6 M documents takes 221 s (vs 349 s for ES), throughput reaches 12.5 K ops/s (vs 3.1 K ops/s for ES), and latency is 8 ms (vs 10 ms for ES). Additional read/write tests show RedisJSON outperforms MongoDB and ES by large margins.

Conclusion: the four approaches are compared—multithreaded CK (10–18 s), single‑thread ES scroll‑scan (no clear gain), ES + HBase (3–6 s), and RediSearch + RedisJSON (to be measured). The article highlights trade‑offs between system complexity, storage duplication, and query latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

query optimization HBase large‑scale data RedisJSON RediSearch

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.