Distributed System Log Printing Optimization and Performance Evaluation
The study evaluates log4j2 and logback performance, recommends asynchronous logback for high‑concurrency workloads, demonstrates latency reductions in a production service, and introduces a TraceContext‑based flag to share logging state across micro‑services, cutting daily log volume by ~80 % and easing distributed system overhead.
Background: iQIYI overseas backend development team supports PHONE/PCW/TV services, along with points, pop‑ups, and program reservation systems. The extensive business relies on a comprehensive logging system, but logging incurs performance overhead. In distributed systems, each node independently logs the same request, causing significant redundancy and resource waste.
Example: Service 1 calls Service 3, then serially calls Service 2, which also depends on Service 3. Service 1, Service 2, and Service 3 all record the same request details, resulting in four identical log entries for a single request.
To address this, a two‑part optimization plan was built:
Quantify the actual performance loss caused by logging and produce a concise SOP for log printing.
Consider a global view of log printing to enable stateful logging across distributed nodes.
The goal is to reduce resource and performance consumption of log printing in distributed call chains and improve overall system performance.
2. Single‑System Log Printing Exploration and Practice
The most popular logging frameworks—log4j, log4j2, and logback—were evaluated. log4j2 is considered an upgrade of log4j, so the focus was on log4j2 (latest) and logback 1.3.0.
2.1 Multi‑dimensional Comparison and SOP Extraction
Test environment: container deployment with 2 CPU + 4 GB RAM, independent project. The project contains an API that logs messages of varying sizes based on input.
2.1.1 log4j2 Performance Quantification
log4j2 asynchronous logging (2 KB log size) vs synchronous logging.
Results:
At low concurrency, log4j2 performance is similar; at higher concurrency, asynchronous logging outperforms synchronous.
Asynchronous logging still has a performance ceiling.
Synchronous logging is IO‑bound (~5.15 MB/s for 2 KB logs).
2.1.2 logback Performance Quantification
Asynchronous logging
Synchronous logging
Results:
logback behaves similarly to log4j2 at low concurrency; at higher concurrency, logback outperforms log4j2.
logback asynchronous logging shows little sensitivity to log size.
Synchronous logback also hits an IO bottleneck (~5.2 MB/s).
2.1.3 logback vs log4j2 Comparison
Both synchronous and asynchronous comparisons show that beyond a certain concurrency threshold, logback consistently delivers better throughput.
2.1.4 logback Performance in Different Scenarios
Tests with 100 concurrent requests:
logback synchronous with varying log sizes
logback asynchronous with varying log sizes
Key observations:
Log size has negligible impact within a certain range; beyond that, performance degrades sharply.
For synchronous logging, IO becomes the bottleneck (~160 MB/s total).
Asynchronous logging can discard logs when the queue is full, reducing sensitivity to log size.
2.2 Best‑Practice Summary
Prefer logback as the logging framework to minimize performance impact.
Use asynchronous logback in high‑concurrency scenarios when business logs are non‑essential.
If logs are critical, configure neverBlock=true and keep per‑request log size < 2 KB to stay under ~5 MB/s IO.
2.3 Project‑Level Optimization
A pilot was conducted on an iQIYI overseas TOC service (deployed in Singapore, 4 CPU + 8 GB, peak QPS ≈ 120). After converting to asynchronous logging, P99 latency dropped from 78.8 ms to 74 ms and P999 from 180 ms to 164.5 ms.
3. Distributed Variable Sharing in Log Printing
While single‑system optimization is effective, most modern services are distributed. Redundant logging across nodes leads to resource waste and performance degradation.
Solution: use a TraceContext with a logBusiness flag. When the first service logs the request, it sets the flag to true; subsequent services check the flag and skip logging if already set.
Benefits:
Reduces daily log volume from ~150 GB to ~30 GB.
Lowers Flink queue consumption for log processing.
Edge cases (e.g., 5xx errors, timeouts) are handled by ensuring that failed services still set the flag appropriately, preventing loss of critical trace information.
4. Summary and Outlook
The article presented both single‑machine and distributed‑system log‑printing optimization strategies. It provided performance comparisons of mainstream logging frameworks, derived SOPs, and demonstrated tangible gains in a production project. The distributed‑variable‑sharing technique offers a promising direction for reducing redundant logs and improving traceability in high‑throughput micro‑service architectures.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.