Operations 17 min read

Why Docker Daemon Is a Minefield: Lessons from a Former Google SRE

In this talk, a former Google SRE shares his experiences moving from massive YouTube growth to Google Cloud operations and later to a Chinese startup, exposing the hidden pitfalls of Docker daemon, image management, container runtimes, and practical DevOps strategies for building, packaging, and running services.

Efficient Ops
Efficient Ops
Efficient Ops
Why Docker Daemon Is a Minefield: Lessons from a Former Google SRE

Personal Introduction: From Google to Coding

I joined Google in 2007, moved to the US headquarters in 2009, and worked on two major projects: YouTube during its explosive growth and later Google Cloud Platform, where I helped operate millions of internal servers.

Current Company and Docker Adoption

At Coding, we build productivity tools for developers such as project management, code repositories, a WebIDE, and a demo platform. In early 2015 we began Docker‑izing our services, initially using a single‑machine deployment model where each developer copied files and ran commands.

We later packaged all code into Docker images and ran them from a private registry. However, we quickly discovered numerous pitfalls.

Docker Engine (Runtime) Issues

The Docker daemon, once hailed as the essential runtime, turned out to be just one way to launch containers. It lacks an init system and does not reap child processes, leading to zombie processes and daemon instability, a problem still unresolved in many Chinese deployments.

Every time we run a Docker program we worry about whether it handles zombie processes correctly.

Docker’s storage layer also proved unreliable: AUFS left many uncleaned files, OverlayFS introduced deadlocks, and BTRFS suffered similar issues.

We even tried replacing the daemon with OpenSSH, only to realize both are essentially remote command execution tools.

It felt like a trojan: a single command runs a program, but with many problems, so we preferred direct SSH.

The daemon’s design caused two major problems:

Atomic operations are fragile; deleting a container involves many system calls, any failure leaves garbage and can deadlock the system.

Restarting the daemon forces dependent services to restart, damaging team reputation.

Docker Image: A Chicken‑Leg Solution

Docker images are essentially compressed packages. While they support layering and reduce transfer size, the savings are negligible in high‑bandwidth internal networks.

Saving 10 MB of disk space in a network that moves hundreds of MB per second is meaningless.

Dockerfiles combine build and packaging, forcing developers to write convoluted scripts that increase build time and image size dramatically.

Our first image was over 3 GB – absurd for just running code.

Practice 1: Separate Build, Package, Run

We learned to decouple building from packaging: first compile the code, then place the compiled artifact into a minimal image. A typical Dockerfile now contains three lines – a base runtime image, an ADD of the compiled package, and a CMD to run it.

Complex build logic belongs in the build step, not in the Dockerfile.

Practice 2: Strip the Fluff

We use Docker only as a container, avoiding unnecessary features like SDN, custom networking, or shared storage. Most of our workloads are simple daemon jobs that benefit from static resource allocation.

We don’t need shared storage; the old model works fine.

Time‑zone and locale mismatches in containers caused log inconsistencies, so we now mount host settings directly.

Practice 3: Tooling, Code‑Driven, Semi‑Automation

We built lightweight tools for atomic operations:

up

to start,

down

to stop, and

rollingupdate

for phased updates. A web UI now displays logs and allows remote container interaction.

Our goal is 80 % semi‑automation: reliable, fast execution without full automation, which remains hard due to edge cases.

Summary

To succeed with containerization and distributed systems, define three clear interfaces – build, package, run – and keep each service’s dependencies self‑contained. Treat production environments as code, abstract tasks into jobs with multiple replicas, and use tooling to make operations repeatable and scalable.

Dockeroperationsdevopscontainerdcontainer runtimeOCI
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.