Operations 16 min read

Distributed System Log Printing Optimization and Performance Evaluation

The study evaluates log4j2 and logback performance, recommends asynchronous logback for high‑concurrency workloads, demonstrates latency reductions in a production service, and introduces a TraceContext‑based flag to share logging state across micro‑services, cutting daily log volume by ~80 % and easing distributed system overhead.

iQIYI Technical Product Team

Jul 28, 2023

Distributed System Log Printing Optimization and Performance Evaluation

Background: iQIYI overseas backend development team supports PHONE/PCW/TV services, along with points, pop‑ups, and program reservation systems. The extensive business relies on a comprehensive logging system, but logging incurs performance overhead. In distributed systems, each node independently logs the same request, causing significant redundancy and resource waste.

Example: Service 1 calls Service 3, then serially calls Service 2, which also depends on Service 3. Service 1, Service 2, and Service 3 all record the same request details, resulting in four identical log entries for a single request.

To address this, a two‑part optimization plan was built:

Quantify the actual performance loss caused by logging and produce a concise SOP for log printing.

Consider a global view of log printing to enable stateful logging across distributed nodes.

The goal is to reduce resource and performance consumption of log printing in distributed call chains and improve overall system performance.

2. Single‑System Log Printing Exploration and Practice

The most popular logging frameworks—log4j, log4j2, and logback—were evaluated. log4j2 is considered an upgrade of log4j, so the focus was on log4j2 (latest) and logback 1.3.0.

2.1 Multi‑dimensional Comparison and SOP Extraction

Test environment: container deployment with 2 CPU + 4 GB RAM, independent project. The project contains an API that logs messages of varying sizes based on input.

2.1.1 log4j2 Performance Quantification

log4j2 asynchronous logging (2 KB log size) vs synchronous logging.

Results:

At low concurrency, log4j2 performance is similar; at higher concurrency, asynchronous logging outperforms synchronous.

Asynchronous logging still has a performance ceiling.

Synchronous logging is IO‑bound (~5.15 MB/s for 2 KB logs).

2.1.2 logback Performance Quantification

Asynchronous logging

Synchronous logging

Results:

logback behaves similarly to log4j2 at low concurrency; at higher concurrency, logback outperforms log4j2.

logback asynchronous logging shows little sensitivity to log size.

Synchronous logback also hits an IO bottleneck (~5.2 MB/s).

2.1.3 logback vs log4j2 Comparison

Both synchronous and asynchronous comparisons show that beyond a certain concurrency threshold, logback consistently delivers better throughput.

2.1.4 logback Performance in Different Scenarios

Tests with 100 concurrent requests:

logback synchronous with varying log sizes

logback asynchronous with varying log sizes

Key observations:

Log size has negligible impact within a certain range; beyond that, performance degrades sharply.

For synchronous logging, IO becomes the bottleneck (~160 MB/s total).

Asynchronous logging can discard logs when the queue is full, reducing sensitivity to log size.

2.2 Best‑Practice Summary

Prefer logback as the logging framework to minimize performance impact.

Use asynchronous logback in high‑concurrency scenarios when business logs are non‑essential.

If logs are critical, configure neverBlock=true and keep per‑request log size < 2 KB to stay under ~5 MB/s IO.

2.3 Project‑Level Optimization

A pilot was conducted on an iQIYI overseas TOC service (deployed in Singapore, 4 CPU + 8 GB, peak QPS ≈ 120). After converting to asynchronous logging, P99 latency dropped from 78.8 ms to 74 ms and P999 from 180 ms to 164.5 ms.

3. Distributed Variable Sharing in Log Printing

While single‑system optimization is effective, most modern services are distributed. Redundant logging across nodes leads to resource waste and performance degradation.

Solution: use a TraceContext with a logBusiness flag. When the first service logs the request, it sets the flag to true; subsequent services check the flag and skip logging if already set.

Benefits:

Reduces daily log volume from ~150 GB to ~30 GB.

Lowers Flink queue consumption for log processing.

Edge cases (e.g., 5xx errors, timeouts) are handled by ensuring that failed services still set the flag appropriately, preventing loss of critical trace information.

4. Summary and Outlook

The article presented both single‑machine and distributed‑system log‑printing optimization strategies. It provided performance comparisons of mainstream logging frameworks, derived SOPs, and demonstrated tangible gains in a production project. The distributed‑variable‑sharing technique offers a promising direction for reducing redundant logs and improving traceability in high‑throughput micro‑service architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance testing logback Traceability log optimization log4j2

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.