Build a Scalable, Cost‑Effective Log Retrieval System Without Elasticsearch
This article explains how to design a high‑performance, low‑cost log retrieval architecture that avoids Elasticsearch by partitioning logs into time‑based chunks, indexing only metadata, using multi‑tier storage (local, remote, archive), and orchestrating queries through GD‑Search, Local‑Search, Remote‑Search and Log‑Manager components.
Background
Logs are the primary way to observe services; they are essential for detecting runtime status, historical conditions, and diagnosing errors. With the rise of micro‑services, a dedicated log service is needed for collection, transmission, and retrieval. The open‑source ELK stack is a common solution.
Requirement Scenario
Peak write pressure of tens of millions of log entries per second.
Real‑time requirement: logs must be searchable within 1 second (3 seconds in peak).
Cost pressure: retain half‑year logs at PB scale.
ElasticSearch Shortcomings
Write performance : Updating inverted indexes for each log entry creates a bottleneck under massive write loads.
Operational cost : Maintaining indexes, shards, and caches consumes significant CPU, memory, and disk space; index bloat further raises costs.
Unstructured log support : Non‑standard logs require extra parsing logic to build indexes.
Because of these limitations, a pure Elasticsearch solution would need a cluster with tens of thousands of cores and still struggle with write and query efficiency.
Log Retrieval Design
The design addresses the above challenges with three key ideas:
1. Log Chunking
Logs are written to files grouped by instance, type, time, and level. No parsing or indexing is performed on the raw log text. Chunking eliminates heavy indexing overhead and allows write speed to be limited only by disk I/O.
2. Metadata Index
When a log chunk is created, its metadata (service name, timestamp, instance, log type, etc.) is stored in a lightweight index (Chunk Index). Queries first locate relevant chunks via this metadata, then retrieve the raw logs directly.
3. Log Lifecycle & Data Sinking
Logs follow a three‑tier storage hierarchy:
Local storage (NVMe SSD) – real‑time and short‑term queries (hours).
Remote storage (object storage) – medium‑term queries (days‑weeks).
Archive storage – long‑term queries (months‑years).
Chunks are first written to local disks, then compressed and moved to remote storage, and finally archived. Compression ratios of ~10:1 reduce storage cost dramatically.
Log Retrieval Architecture
The service consists of several stateless modules:
GD‑Search : query scheduler that parses, optimizes, and determines the range of chunks from the Chunk Index, generating a distributed query plan.
Local‑Search : executes queries on chunks located in local storage.
Remote‑Search : fetches required chunks from remote storage, decompresses them locally, and then performs the same search as Local‑Search.
Log‑Manager : manages the lifecycle of local chunks, compresses and uploads them when disk pressure or retention limits are reached.
Log‑Ingester : subscribes to Kafka, splits incoming logs by time and metadata, writes them to appropriate chunks, and updates the Chunk Index.
Chunk Index : stores chunk metadata; implemented with Redis for fast in‑memory lookups.
Retrieval Strategy
Users can set a
limiton the number of log lines returned; the service stops scanning once the limit is satisfied. GD‑Search also checks the total size of candidate chunks and rejects queries that would exceed a predefined threshold.
Performance Overview
Write : a single core can handle ~20 k logs/s; distributed scaling provides virtually unlimited throughput.
Query : 1 TB of logs on local storage can be searched within 3 seconds; the same amount on remote storage takes about 10 seconds.
Cost Advantages
Because no full‑text index is built, only a few thousand cores are needed to sustain tens of millions of writes per second and support hundreds of QPS queries. Storage cost is reduced by using cheap archive storage for cold data and by achieving a 10:1 compression ratio compared to Elasticsearch index bloat.
(Author: Zuoyebang Infrastructure Team – Lü Yalin, Mo Renpeng)
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.