Cloud Native 13 min read

Automated Business Log Collection in Zhaozhuan Container Cloud Platform Using Log‑Pilot

This article describes how Zhaozhuan built an automated, business‑transparent log‑collection solution for its container cloud platform by evaluating several approaches, adopting Alibaba Cloud's open‑source log‑pilot, customizing its deployment, and addressing practical issues such as time‑zone bugs, latency, and duplicate collection.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Automated Business Log Collection in Zhaozhuan Container Cloud Platform Using Log‑Pilot

Background

With the rapid adoption of micro‑services and Docker, companies are moving workloads to container clouds to maximize resource utilization and reduce operational costs. The data team at Zhaozhuan faced a challenge: business logs generated by application instrumentation would become inaccessible after migration, requiring a cloud‑native log‑collection solution.

Research of Collection Methods

Four approaches were examined:

Directly pushing logs to Kafka or Redis – highly intrusive and unreliable.

Writing logs to standard output and letting Docker Engine collect them – still intrusive.

Deploying a sidecar collector (e.g., Flume, Filebeat) inside each container – invisible to the business but adds risk of container crashes and data loss on stop.

Mounting a host directory for logs – preserves logs after container termination but requires unique mount points and dynamic configuration updates.

The fourth method was chosen as the basis for the final design.

Architecture Solution

Zhaozhuan adopted Alibaba Cloud’s open‑source log‑pilot , a lightweight Go tool that dynamically discovers container events, parses container labels, and generates log‑collector configuration files. The tool consists of three parts, with the container‑event‑management module listening to Docker events and producing configuration for downstream collectors.

The overall workflow is illustrated in the original diagrams (omitted here).

Deployment Practice

Each host runs a dedicated log‑pilot container that monitors all containers on that host. To stay compatible with the existing file‑mode collection pipeline, logs are first collected on the host and then processed by the traditional log‑collection system.

Key Code Snippets

Fluentd buffer configuration (used initially):

<buffer tag,time,docker_app,docker_service,docker_container>
  @type ${FILE_BUFFER_TYPE:=file}
  path $FILE_PATH/.buffer
  chunk_limit_size 8MB
  chunk_limit_records 1000
  flush_thread_count 20
  flush_at_shutdown true
  timekey ${FILE_BUFFER_TIME_KEY:=1d}
  timekey_wait ${FILE_BUFFER_TIME_KEY_WAIT:=2m}
  timekey_use_utc ${FILE_BUFFER_TIME_KEY_USE_UTC:=false}
  $(bufferd_output)
</buffer>

Go code that listens to Docker events and processes them:

// continuously listen
msgs, errs := p.client().Events(ctx, options)
for {
  select {
  case msg := <-msgs:
    if err := p.processEvent(msg); err != nil {
      // handle error
    }
  case err := <-errs:
    // handle error
  }
}

// event processing function
func (p *Pilot) processEvent(msg events.Message) error {
  containerId := msg.Actor.ID
  ctx := context.Background()
  switch msg.Action {
  case "start", "restart":
    // inspect container and create config
    containerJSON, err := p.client().ContainerInspect(ctx, containerId)
    // ...
    return p.newContainer(&containerJSON)
  case "destroy":
    // remove container config
    err := p.delContainer(containerId)
    // ...
  }
  return nil
}

Flume configuration template generated by log‑pilot (simplified):

{{range .configList}}

a1.sources.{{if index $.container "k8s_pod"}}{{ index $.container "k8s_pod" }}{{else}}{{ $.containerId }}{{end}}_{{ .Name }}_source.type = TAILDIR
... (additional source, channel, and sink definitions) ...
{{end}}

Practical Issues and Optimizations

During production the team encountered:

Time‑zone bug where timekey_use_utc ignored the setting, fixed by upgrading Fluentd.

Data latency caused by buffer back‑pressure; mitigated by increasing thread count and chunk limits.

Duplicate collection after container restarts due to unchanged host mount paths; solved by storing offsets per pod name.

Further customizations included extending dynamic configuration, handling large‑volume logs, and switching to Flume’s Taildir Source + File Channel + File Roll Sink for reliable archiving.

Conclusion

The customized log‑pilot solution integrates seamlessly with the existing log‑collection workflow, provides fully transparent log collection for business teams, eliminates manual configuration updates on container start/stop, and significantly reduces operational overhead.

Thanks are extended to Alibaba Cloud for open‑sourcing log‑pilot and to the community for reviewing the contributed PRs.

cloud-nativegolangcontainerlog collectionflumeFluentdlog-pilot
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.