Design and Optimization Strategies for Querying 100K Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch
This article examines a business requirement to filter up to 100,000 items from a pool of tens of millions, presenting and evaluating four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—along with performance data and implementation details.
Introduction: the business need is to select no more than 100 K records from a pool of tens of millions, then sort and disperse them according to configurable weight rules.
Initial design proposes four schemes: (1) multithreaded ClickHouse (CK) pagination, (2) Elasticsearch scroll‑scan deep pagination, (3) an ES + HBase combined query, and (4) a RediSearch + RedisJSON combined query.
CK pagination implementation uses multithreading; the core code is shown below:
int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List
> result = Lists.newArrayList();
List
>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
selectionQueryCondition.setPageSize(pageSize);
selectionQueryCondition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future
>> future : futureList) {
List
> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
result.addAll(queryRes);
}
}CK pagination relies on LIMIT #{limitStart}, #{limitEnd} , which leads to performance issues on deep pagination (worst‑case 10–18 s for a 10 M pool).
Elasticsearch scroll‑scan deep pagination is introduced; several ES paging methods are listed (from + size, scroll, scroll‑scan, search‑after). Performance charts show ES excels when the result set is below ~30 K but degrades for larger sets.
The ES + HBase hybrid uses ES solely for filtering (returning only sku_id and doc_id ) and then fetches full fields from HBase via rowKey lookup, achieving 3–6 s worst‑case latency on a 10 M pool.
RediSearch + RedisJSON is described with benchmark data: indexing 5.6 M documents takes 221 s (vs 349 s for ES), throughput reaches 12.5 K ops/s (vs 3.1 K ops/s for ES), and latency is 8 ms (vs 10 ms for ES). Additional read/write tests show RedisJSON outperforms MongoDB and ES by large margins.
Conclusion: the four approaches are compared—multithreaded CK (10–18 s), single‑thread ES scroll‑scan (no clear gain), ES + HBase (3–6 s), and RediSearch + RedisJSON (to be measured). The article highlights trade‑offs between system complexity, storage duplication, and query latency.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.