Design and Implementation of JD Daojia Log System Based on Loki
This document details the motivation, architecture, components, query language, and deployment of a Loki‑based log collection and analysis platform for JD Daojia, comparing it with ELK, describing ingestion, real‑time and historical log handling, technical challenges, configuration examples, and future scaling plans.
1. Background
With rapid business growth, the existing ELK‑based log system cannot meet JD Daojia's storage and query requirements. ELK relies on full‑text indexing, causing data size to balloon and consuming excessive compute resources during writes, which is inefficient for write‑heavy, read‑light log workloads. Log collection also requires manual configuration.
After evaluating popular log solutions, Loki was selected as a lightweight alternative to ELK.
2. Loki Log System
2.1 Loki Architecture
Loki is an open‑source, horizontally scalable, highly available, multi‑tenant log aggregation system from Grafana Labs. It stores logs without full‑text indexing, using label‑based indexing for efficient storage and retrieval.
Key components:
loki : the main server that stores logs and handles queries.
promtail : a client tailored for Loki that collects logs and forwards them to the server.
Grafana : UI for visualizing logs (or a custom front‑end can be built).
Components
Distributor : receives log streams from promtail, batches and compresses them (gzip) before forwarding to appropriate ingesters based on a hash algorithm.
Ingester : builds and stores chunks of logs; when a chunk reaches a size or time threshold it is flushed to the backend storage.
Querier : processes read requests, selects matching chunks using label selectors, and merges results from ingesters and the long‑term store.
2.2 Loki Read/Write Flow
Write Path
Distributor receives an HTTP request containing a log stream.
The stream is hashed to determine the target ingester.
Distributor forwards the stream to the chosen ingester (and its replicas).
Ingester creates a new chunk or appends to an existing one.
Distributor sends an HTTP response back to the client.
Read Path
Querier receives an HTTP request from Grafana or a custom front‑end.
Querier asks ingesters for in‑memory data.
If ingesters have no data, Querier reads from the long‑term store.
Querier de‑duplicates and merges results, returning them via HTTP.
2.3 Loki Query Language
Log streams are selected with label selectors inside {} , e.g., {app="mysql",name="mysql-backup"} . Supported operators include = , != , =~ (regex match), and !~ (regex not match).
After selecting streams, filter expressions can be applied:
|= line : include lines containing the string.
!= line : exclude lines containing the string.
|~ line : include lines matching the regex.
!~ line : exclude lines matching the regex.
3. JD Daojia Application Log System
The operations team built a log analysis platform on top of Loki.
3.1 Architecture
Frontend UI is provided by Grafana; a custom Python/Flask page integrates it into the operations management console. Backend storage uses Cassandra for horizontal scalability. Promtail automatically discovers and collects log files based on user‑defined configurations.
3.2 Log Ingestion
Users select an application and host, then specify log file paths. A SaltStack client writes these paths into Promtail’s configuration on the target host. Promtail’s file‑discovery mechanism picks up new logs and pushes them to the Loki server.
3.3 Real‑time Logs
Loki’s API is accessed via WebSocket to stream logs to the UI in real time.
3.4 Historical Logs
Users can query historical logs by selecting application, host, and log file, then entering space‑separated keywords for filtering. In production (1×48‑core, 256 GB RAM, 12 × 6 TB SATA), 10 TB of data can be queried across multiple logs and keywords with results returned in about 5 seconds.
3.5 Technical Challenges
File‑watch based log paths allow flexible configuration without restarting clients.
Distributed storage with Cassandra provides horizontal scaling.
WebSocket enables low‑latency real‑time log display.
Promtail clients are packaged into the OS image to simplify deployment on new machines.
3.6 Configuration Example
LokiServer配置:
auth_enabled: false #关闭认证,对数据来源客户端不做认证
server:
http_listen_port: 3101 #loki http端口
ingester:
lifecycler:
address: 127.0.0.1 #ingester地址,默认为本机127.0.0.1,如果有多台server也可以写多个
ring:
kvstore:
store: inmemory #使用内存做为ingester存储
replication_factor: 1
chunk_idle_period: 5m #在没有更新之前chunk在内存中的时间
chunk_retain_period: 30s #刷新后应在内存中保留多长时间
storage_config:
cassandra:
addresses: x.x.x.x #cassandra的IP
keyspace: lokiindex #cassandra的keyspace
auth: false #关闭cassandra认证
schema_config:
configs:
- from: 2020-07-01
store: cassandra #数据存储方式为cassandra
object_store: cassandra
schema: v11
index:
prefix: index_ #index表的前缀
period: 168h #index每张表的时间范围7天
chunks:
prefix: chunk_ #chunks表的前缀
period: 168h #chunks每张表的时间范围7天
limits_config:
ingestion_rate_mb: 50 #每个用户每秒的采样率限制
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 168h # 最大可查询历史日期 7天
table_manager:
retention_deletes_enabled: true #超过retention_period时间历史数据可以删除
retention_period: 168h
# Promtail客户端:
server:
http_listen_port: 0 #http端口,为0表示端口随机
grpc_listen_port: 0 #grpc端口,为0表示端口随机
positions:
filename: /export/servers/promtail/tmp/positions.yaml #记录采集的文件路径与日志采集位置
clients:
- url: http://xx.xx.xxx/loki/api/v1/push
scrape_configs:
- job_name: daojia #job_name 用于标识抓取配置
file_sd_configs:
- files:
- '/export/servers/promtail/logpath.yaml' #具体文件
refresh_interval: 10s #文件检测间隔,当有新日志加入会自动发现并自动采集日志,无须客户端重启
targets:
- localhost #收集的主机
labels:
host: 1.1.1.1 #给收集主机打标签host:1.1.1.1
log: gw #给收集主机打标签log: gw
__path__: /export/servers/nginx/logs/gw.o2o.jd.local/gw.o2o.jd.local_access.log #要收集的日志路径4. Summary and Planning
More than 1,000 application servers now feed logs into the Loki‑based platform, consuming only 1.4 TB per day compared with the ~30 TB required by the previous Elasticsearch solution, dramatically reducing hardware costs. Automated log discovery based on front‑end naming conventions has also cut manual operational effort.
The entire platform runs on a single 48‑core, 256 GB RAM, 12 × 6 TB SATA machine. Future scaling will involve adding Loki nodes and expanding the Cassandra cluster to handle increased client volume, while the front‑end will evolve to provide richer analysis and visualization of search results.
Dada Group Technology
Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.