Operations 14 min read

Build a Scalable, Cost‑Effective Log Retrieval System Without Elasticsearch

This article explains how to design a high‑performance, low‑cost log retrieval architecture that avoids Elasticsearch by partitioning logs into time‑based chunks, indexing only metadata, using multi‑tier storage (local, remote, archive), and orchestrating queries through GD‑Search, Local‑Search, Remote‑Search and Log‑Manager components.

Zuoyebang Tech Team

May 13, 2022

Build a Scalable, Cost‑Effective Log Retrieval System Without Elasticsearch

Background

Logs are the primary way to observe services; they are essential for detecting runtime status, historical conditions, and diagnosing errors. With the rise of micro‑services, a dedicated log service is needed for collection, transmission, and retrieval. The open‑source ELK stack is a common solution.

Requirement Scenario

Peak write pressure of tens of millions of log entries per second.

Real‑time requirement: logs must be searchable within 1 second (3 seconds in peak).

Cost pressure: retain half‑year logs at PB scale.

ElasticSearch Shortcomings

Write performance : Updating inverted indexes for each log entry creates a bottleneck under massive write loads.

Operational cost : Maintaining indexes, shards, and caches consumes significant CPU, memory, and disk space; index bloat further raises costs.

Unstructured log support : Non‑standard logs require extra parsing logic to build indexes.

Because of these limitations, a pure Elasticsearch solution would need a cluster with tens of thousands of cores and still struggle with write and query efficiency.

Log Retrieval Design

The design addresses the above challenges with three key ideas:

1. Log Chunking

Logs are written to files grouped by instance, type, time, and level. No parsing or indexing is performed on the raw log text. Chunking eliminates heavy indexing overhead and allows write speed to be limited only by disk I/O.

2. Metadata Index

When a log chunk is created, its metadata (service name, timestamp, instance, log type, etc.) is stored in a lightweight index (Chunk Index). Queries first locate relevant chunks via this metadata, then retrieve the raw logs directly.

3. Log Lifecycle & Data Sinking

Logs follow a three‑tier storage hierarchy:

Local storage (NVMe SSD) – real‑time and short‑term queries (hours).

Remote storage (object storage) – medium‑term queries (days‑weeks).

Archive storage – long‑term queries (months‑years).

Chunks are first written to local disks, then compressed and moved to remote storage, and finally archived. Compression ratios of ~10:1 reduce storage cost dramatically.

Log Retrieval Architecture

The service consists of several stateless modules:

GD‑Search : query scheduler that parses, optimizes, and determines the range of chunks from the Chunk Index, generating a distributed query plan.

Local‑Search : executes queries on chunks located in local storage.

Remote‑Search : fetches required chunks from remote storage, decompresses them locally, and then performs the same search as Local‑Search.

Log‑Manager : manages the lifecycle of local chunks, compresses and uploads them when disk pressure or retention limits are reached.

Log‑Ingester : subscribes to Kafka, splits incoming logs by time and metadata, writes them to appropriate chunks, and updates the Chunk Index.

Chunk Index : stores chunk metadata; implemented with Redis for fast in‑memory lookups.

Retrieval Strategy

Users can set a limit on the number of log lines returned; the service stops scanning once the limit is satisfied. GD‑Search also checks the total size of candidate chunks and rejects queries that would exceed a predefined threshold.

Performance Overview

Write : a single core can handle ~20 k logs/s; distributed scaling provides virtually unlimited throughput.

Query : 1 TB of logs on local storage can be searched within 3 seconds; the same amount on remote storage takes about 10 seconds.

Cost Advantages

Because no full‑text index is built, only a few thousand cores are needed to sustain tens of millions of writes per second and support hundreds of QPS queries. Storage cost is reduced by using cheap archive storage for cold data and by achieving a 10:1 compression ratio compared to Elasticsearch index bloat.

(Author: Zuoyebang Infrastructure Team – Lü Yalin, Mo Renpeng)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Storage Optimization log retrieval cost efficiency metadata indexing

Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.