Operations 15 min read

Design and Implementation of JD Daojia Log System Based on Loki

This document details the motivation, architecture, components, query language, and deployment of a Loki‑based log collection and analysis platform for JD Daojia, comparing it with ELK, describing ingestion, real‑time and historical log handling, technical challenges, configuration examples, and future scaling plans.

Dada Group Technology
Dada Group Technology
Dada Group Technology
Design and Implementation of JD Daojia Log System Based on Loki

1. Background

With rapid business growth, the existing ELK‑based log system cannot meet JD Daojia's storage and query requirements. ELK relies on full‑text indexing, causing data size to balloon and consuming excessive compute resources during writes, which is inefficient for write‑heavy, read‑light log workloads. Log collection also requires manual configuration.

After evaluating popular log solutions, Loki was selected as a lightweight alternative to ELK.

2. Loki Log System

2.1 Loki Architecture

Loki is an open‑source, horizontally scalable, highly available, multi‑tenant log aggregation system from Grafana Labs. It stores logs without full‑text indexing, using label‑based indexing for efficient storage and retrieval.

Key components:

loki : the main server that stores logs and handles queries.

promtail : a client tailored for Loki that collects logs and forwards them to the server.

Grafana : UI for visualizing logs (or a custom front‑end can be built).

Components

Distributor : receives log streams from promtail, batches and compresses them (gzip) before forwarding to appropriate ingesters based on a hash algorithm.

Ingester : builds and stores chunks of logs; when a chunk reaches a size or time threshold it is flushed to the backend storage.

Querier : processes read requests, selects matching chunks using label selectors, and merges results from ingesters and the long‑term store.

2.2 Loki Read/Write Flow

Write Path

Distributor receives an HTTP request containing a log stream.

The stream is hashed to determine the target ingester.

Distributor forwards the stream to the chosen ingester (and its replicas).

Ingester creates a new chunk or appends to an existing one.

Distributor sends an HTTP response back to the client.

Read Path

Querier receives an HTTP request from Grafana or a custom front‑end.

Querier asks ingesters for in‑memory data.

If ingesters have no data, Querier reads from the long‑term store.

Querier de‑duplicates and merges results, returning them via HTTP.

2.3 Loki Query Language

Log streams are selected with label selectors inside {} , e.g., {app="mysql",name="mysql-backup"} . Supported operators include = , != , =~ (regex match), and !~ (regex not match).

After selecting streams, filter expressions can be applied:

|= line : include lines containing the string.

!= line : exclude lines containing the string.

|~ line : include lines matching the regex.

!~ line : exclude lines matching the regex.

3. JD Daojia Application Log System

The operations team built a log analysis platform on top of Loki.

3.1 Architecture

Frontend UI is provided by Grafana; a custom Python/Flask page integrates it into the operations management console. Backend storage uses Cassandra for horizontal scalability. Promtail automatically discovers and collects log files based on user‑defined configurations.

3.2 Log Ingestion

Users select an application and host, then specify log file paths. A SaltStack client writes these paths into Promtail’s configuration on the target host. Promtail’s file‑discovery mechanism picks up new logs and pushes them to the Loki server.

3.3 Real‑time Logs

Loki’s API is accessed via WebSocket to stream logs to the UI in real time.

3.4 Historical Logs

Users can query historical logs by selecting application, host, and log file, then entering space‑separated keywords for filtering. In production (1×48‑core, 256 GB RAM, 12 × 6 TB SATA), 10 TB of data can be queried across multiple logs and keywords with results returned in about 5 seconds.

3.5 Technical Challenges

File‑watch based log paths allow flexible configuration without restarting clients.

Distributed storage with Cassandra provides horizontal scaling.

WebSocket enables low‑latency real‑time log display.

Promtail clients are packaged into the OS image to simplify deployment on new machines.

3.6 Configuration Example

LokiServer配置:
  auth_enabled: false  #关闭认证,对数据来源客户端不做认证
  server:
    http_listen_port: 3101  #loki http端口
  ingester:
    lifecycler:
      address: 127.0.0.1   #ingester地址,默认为本机127.0.0.1,如果有多台server也可以写多个
      ring:
        kvstore:
          store: inmemory  #使用内存做为ingester存储
        replication_factor: 1
    chunk_idle_period: 5m  #在没有更新之前chunk在内存中的时间
    chunk_retain_period: 30s  #刷新后应在内存中保留多长时间
  storage_config:
    cassandra:
      addresses: x.x.x.x #cassandra的IP
      keyspace: lokiindex  #cassandra的keyspace
      auth: false    #关闭cassandra认证
  schema_config:
    configs:
      - from: 2020-07-01
        store: cassandra   #数据存储方式为cassandra
        object_store: cassandra
        schema: v11
        index:
          prefix: index_  #index表的前缀
          period: 168h   #index每张表的时间范围7天
        chunks:
          prefix: chunk_  #chunks表的前缀
          period: 168h  #chunks每张表的时间范围7天
  limits_config:
    ingestion_rate_mb: 50 #每个用户每秒的采样率限制
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 168h
  chunk_store_config:
    max_look_back_period: 168h # 最大可查询历史日期 7天
  table_manager:
    retention_deletes_enabled: true #超过retention_period时间历史数据可以删除
    retention_period: 168h
  # Promtail客户端:
  server:
    http_listen_port: 0 #http端口,为0表示端口随机
    grpc_listen_port: 0 #grpc端口,为0表示端口随机
  positions:
    filename: /export/servers/promtail/tmp/positions.yaml #记录采集的文件路径与日志采集位置
  clients:
    - url: http://xx.xx.xxx/loki/api/v1/push
  scrape_configs:
    - job_name: daojia #job_name 用于标识抓取配置
      file_sd_configs:
        - files:
            - '/export/servers/promtail/logpath.yaml' #具体文件
          refresh_interval: 10s #文件检测间隔,当有新日志加入会自动发现并自动采集日志,无须客户端重启
      targets:
        - localhost  #收集的主机
      labels:
        host: 1.1.1.1 #给收集主机打标签host:1.1.1.1
        log: gw           #给收集主机打标签log: gw
        __path__: /export/servers/nginx/logs/gw.o2o.jd.local/gw.o2o.jd.local_access.log #要收集的日志路径

4. Summary and Planning

More than 1,000 application servers now feed logs into the Loki‑based platform, consuming only 1.4 TB per day compared with the ~30 TB required by the previous Elasticsearch solution, dramatically reducing hardware costs. Automated log discovery based on front‑end naming conventions has also cut manual operational effort.

The entire platform runs on a single 48‑core, 256 GB RAM, 12 × 6 TB SATA machine. Future scaling will involve adding Loki nodes and expanding the Cassandra cluster to handle increased client volume, while the front‑end will evolve to provide richer analysis and visualization of search results.

distributed systemsobservabilityLog ManagementGrafanaCassandraLoki
Dada Group Technology
Written by

Dada Group Technology

Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.