Cloud Native 22 min read

Zero‑Downtime Upgrade of Large‑Scale Kubernetes Clusters from v1.10 to v1.17

This article details the challenges, strategies, and step‑by‑step procedures for upgrading a 1,000‑node Kubernetes cluster from version 1.10 to 1.17 without service interruption, covering compatibility checks, in‑place versus replacement upgrades, container‑restart avoidance, pod eviction handling, and TCP connection issues.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Zero‑Downtime Upgrade of Large‑Scale Kubernetes Clusters from v1.10 to v1.17

The rapid three‑month release cadence of Kubernetes brings new features and bug fixes, but frequent upgrades are difficult for large‑scale production environments where any misstep can cause significant economic loss; therefore, a careful balance between innovation and stability is required.

Vivo's internet team operated clusters on v1.10 for a long time, but increasing containerization and performance demands made an upgrade to v1.17 urgent. Benefits of the upgrade include performance optimizations, support for CNCF projects such as OpenKruise, better resource utilization, and reduced operational overhead from version fragmentation.

Upgrade challenges include avoiding container restarts caused by the kubelet hash‑value change between v1.10 and v1.17, adhering to the community‑recommended version‑skew policy (no more than one final release gap), and handling API deprecations that may break existing manifests.

Upgrade approaches :

Replacement upgrade creates a new high‑version cluster and gradually drains nodes from the old cluster, offering strong atomicity but requiring extensive node churn and being unfriendly to stateful or single‑replica workloads. In‑place upgrade updates kubelet, controller‑manager, and other components on each node in a defined order, providing easier automation and better continuity for containers, though it demands careful ordering to avoid intermediate failure states.

For the binary‑deployed clusters used by Vivo, an in‑place upgrade was chosen due to its shorter duration and lower impact on single‑replica services.

Cross‑version upgrade considerations highlight the API compatibility policy: resources deprecated in v1.16 (e.g., extensions/v1beta1) are removed in v1.18, so upgrading across three or more minor versions can render objects unrecognizable. The team followed the recommended stepwise path (at least seven incremental upgrades from v1.10 to v1.17) after thorough ChangeLog review.

Avoiding container restarts required understanding the kubelet’s computePodActions logic. The relevant code snippet is shown below:

func (m *kubeGenericRuntimeManager) computePodActions(pod *v1.Pod, podStatus *kubecontainer.PodStatus) podActions {
    restart := shouldRestartOnFailure(pod)
    if _, _, changed := containerChanged(&container, containerStatus); changed {
        message = fmt.Sprintf("Container %s definition changed", container.Name)
        // If container spec changed, force restart
        restart = true
    }
    ...
    if restart {
        message = fmt.Sprintf("%s, will be restarted", message)
        // Add container index to restart list
        changes.ContainersToStart = append(changes.ContainersToStart, idx)
    }
}

func containerChanged(container *v1.Container, containerStatus *kubecontainer.ContainerStatus) (uint64, uint64, bool) {
    // Compute container spec hash
    expectedHash := kubecontainer.HashContainer(container)
    return expectedHash, containerStatus.Hash, containerStatus.Hash != expectedHash
}

Because v1.17 computes the hash from the JSON‑serialized container struct (including new fields), the hash differs from v1.10, triggering restarts. Rather than patching kubelet code, the team introduced a cache file that records the old cluster version; kubelet skips hash verification for pods created before the upgrade, preventing unnecessary restarts.

Pod eviction handling focuses on the TaintBasedEvictions feature introduced in v1.13, which uses tolerationSeconds to control eviction timing. Without explicit tolerations, pods may be evicted within seconds after a node becomes NotReady. Adding appropriate tolerations via a label‑based script resolves the issue. Example toleration configuration:

tolerations:
- effect: NoExecute
  key: node.kubernetes.io/not-ready
  operator: Exists
  tolerationSeconds: 300
- effect: NoExecute
  key: node.kubernetes.io/unreachable
  operator: Exists
  tolerationSeconds: 300

Unexpected MatchNodeSelector failures were traced to kubelet’s admission check after a restart, where missing node labels caused pods to enter the MatchNodeSelector failed state. Adding the required node labels allowed pods to be rescheduled.

TCP connection explosion was observed after the upgrade: each node opened ~10 long‑lived connections to the API server instead of the single connection seen in v1.10. The root cause was a change in client‑go’s transport caching logic when TLS config contained custom Dial or Proxy fields, preventing reuse of existing connections. The problematic code is shown below:

// client‑go transport cache logic
func tlsConfigKey(c *Config) (tlsCacheKey, bool, error) {
    if c.TLS.GetCert != nil || c.Dial != nil || c.Proxy != nil {
        // cannot determine equality for functions
        return tlsCacheKey{}, false, nil
    }
    ...
}

func (c *tlsTransportCache) get(config *Config) (http.RoundTripper, error) {
    key, canCache, err := tlsConfigKey(config)
    if canCache {
        c.mu.Lock()
        defer c.mu.Unlock()
        if t, ok := c.transports[key]; ok {
            return t, nil
        }
    }
    ...
}

func updateDialer(clientConfig *restclient.Config) (func(), error) {
    if clientConfig.Transport != nil || clientConfig.Dial != nil {
        return nil, fmt.Errorf("there is already a transport or dialer configured")
    }
    d := connrotation.NewDialer((&net.Dialer{Timeout: 30 * time.Second, KeepAlive: 30 * time.Second}).DialContext
    clientConfig.Dial = d
    return d.CloseAll, nil
}

The team back‑ported a fix to v1.17 and upgraded Golang to 1.15.15, which eliminated the extra connections.

Upgrade procedure (in‑place binary upgrade):

Backup the cluster (binaries, configuration, ETCD).

Gray‑scale upgrade a subset of nodes to validate binaries and configs.

Distribute the new binaries.

Stop controllers, scheduler, and alerts.

Update control‑plane component configs and restart them.

Update compute‑node component configs and restart them.

Label nodes to add required tolerations.

Restart controllers, scheduler, and re‑enable alerts.

Perform post‑upgrade health checks.

During the upgrade, limit concurrent node upgrades to avoid overwhelming the API‑server load balancer, which could cause nodes to flip between Ready and NotReady states.

Conclusion – After addressing key compatibility, container‑restart, pod‑eviction, and TCP‑connection issues, the team successfully upgraded a 1,000‑node cluster from v1.10 to v1.17 in roughly ten minutes per batch, improving stability, scalability, and CNCF project compatibility, while providing a repeatable framework for future version jumps.

operationskubernetesCluster Upgradezero downtimeCNCFkubeletVersion Skew
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.