Why Do Readiness Probe Failures Show “OCI runtime exec failed: EOF” in Kubernetes?
A Kubernetes pod reported readiness probe warnings with an OCI runtime exec failure, which was traced through kubelet, Docker, dockershim, containerd, and runc, ultimately caused by a race condition with cpu‑manager updating the container state file, and resolved by disabling cpu‑manager or upgrading runc.
Introduction
Problem investigation process, source code part recorded by a developer colleague; published with consent.
Problem
Customer reported many warning events:
Readiness probe failed: OCI runtime exec failed: exec failed: EOF: unknown, but the service remained accessible.
Environment
Note: the customer enabled cpu-manager on the k8s node running the workload.
Component
Version
k8s
1.14.x
Investigation
1. After receiving the feedback, check the kubelet logs on the node where the pod runs:
<code>I0507 03:43:28.310630 57003 prober.go:112] Readiness probe for "adsfadofadfabdfhaodsfa(d1aab5f0-ae8f-11eb-a151-080027049c65):c0" failed (failure): OCI runtime exec failed: exec failed: EOF: unknown
I0507 07:08:49.834093 57003 prober.go:112] Readiness probe for "adsfadofadfabdfhaodsfa(a89a158e-ae8f-11eb-a151-080027049c65):c0" failed (failure): OCI runtime exec failed: exec failed: unexpected EOF: unknown
I0507 10:06:58.307881 57003 prober.go:112] Readiness probe for "adsfadofadfabdfhaodsfa(d1aab5f0-ae8f-11eb-a151-080027049c65):c0" failed (failure): OCI runtime exec failed: exec failed: EOF: unknown</code>The probe error type is
failure, corresponding code is shown:
2. Check Docker logs:
<code>time="2021-05-06T16:51:40.009989451+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2021-05-06T16:51:40.010054596+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2021-05-06T16:51:40.170676532+08:00" level=error msg="Error running exec 8e34e8b910694abe95a467b2936b37635fdabd2f7b7c464dfef952fa5732aa4e in container: OCI runtime exec failed: exec failed: EOF: unknown"</code>Although Docker logs show a stream copy error, the underlying
runcreturned EOF, causing the error. Because the probe type is Failure,
e.CombinedOutPut()returns a non‑nil error and a non‑zero exit status, which leads to a call to
ExecInContainer.
ExecInContainerimplementation (excerpt):
<code>func (*NativeExecHandler) ExecInContainer(client libdocker.Interface, container *dockertypes.ContainerJSON, cmd []string, stdin io.Reader, stdout, stderr io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize, timeout time.Duration) error {
execObj, err := client.CreateExec(container.ID, createOpts)
startOpts := dockertypes.ExecStartCheck{Detach: false, Tty: tty}
streamOpts := libdocker.StreamOptions{InputStream: stdin, OutputStream: stdout, ErrorStream: stderr, RawTerminal: tty, ExecStarted: execStarted}
err = client.StartExec(execObj.ID, startOpts, streamOpts)
if err != nil { return err }
// poll for completion
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for {
inspect, err2 := client.InspectExec(execObj.ID)
if err2 != nil { return err2 }
if !inspect.Running {
if inspect.ExitCode != 0 { err = &dockerExitError{inspect} }
break
}
<-ticker.C
}
return err
}
</code>ExecInContainer performs three main steps:
Call
CreateExecto create an ExecID.
Call
StartExecto run the exec and redirect I/O.
Call
InspectExecto obtain the running status and exit code.
The error printed in the logs is the response stream from dockerd, i.e., dockerd’s response contains the error.
Further tracing shows that
ExecStarteventually calls containerd code, which invokes
runc. The
runcexec fails with
exec failed: EOF: unknown.
Repeated execution of
runcreproduces the issue sporadically. Investigation revealed that
runcreads the container’s
state.json. When the kubelet cpu‑manager updates the container (default every 10 s), it writes to
state.jsonconcurrently, causing a partial write. The JSON decoder then encounters an unexpected EOF.
A related runc PR fixes the problem by making
saveStatean atomic operation.
<code>// original saveState
func (c *linuxContainer) saveState(s *State) error {
f, err := os.Create(filepath.Join(c.root, stateFilename))
if err != nil { return err }
defer f.Close()
return utils.WriteJSON(f, s)
}
// fixed saveState
func (c *linuxContainer) saveState(s *State) (retErr error) {
tmpFile, err := ioutil.TempFile(c.root, "state-")
if err != nil { return err }
defer func() {
if retErr != nil {
tmpFile.Close()
os.Remove(tmpFile.Name())
}
}()
err = utils.WriteJSON(tmpFile, s)
if err != nil { return err }
err = tmpFile.Close()
if err != nil { return err }
stateFilePath := filepath.Join(c.root, stateFilename)
return os.Rename(tmpFile.Name(), stateFilePath)
}
</code>Solution
Disable cpu‑manager.
Upgrade
runcto a version containing the above fix.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.