Operations 11 min read

Improving Server Reliability by Reducing Memory Faults: Alibaba's Memory Fault Isolation Enhancements

The article explains how Alibaba's infrastructure team tackles unexpected server outages caused by memory hardware failures by enhancing memory fault isolation, using AI‑driven prediction, hardware‑level segregation, and improved diagnostics to boost overall system stability and reduce downtime.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Improving Server Reliability by Reducing Memory Faults: Alibaba's Memory Fault Isolation Enhancements

"Stability and cost‑performance are the core competitive advantages of cloud services. Stability is the foundation; without it, a cloud skyscraper can become a dangerous building in an instant. Although cloud‑native architectures can better tolerate single‑point failures, standards must not be lowered because even a single‑point fault can damage reputation and cause system‑wide outages," said Alibaba researcher Yu Feng at the 2018 Hangzhou Cloud Expo.

To improve system stability, Alibaba's Infrastructure Server System Innovation team, together with Alibaba Cloud business units, addressed unexpected outages caused by server hardware issues.

Why start improving overall reliability from memory?

1) Memory, as a component of the von Neumann architecture, has evolved faster than CPUs. It directly caches data for the processor, so memory failures can halt the CPU and cause system crashes.

2) From an engineering perspective, memory density has increased dramatically (from hundreds of MB to several hundred GB per DIMM), voltage has dropped, and frequency has risen. These adverse factors offset semiconductor reliability gains, leading to higher memory failure rates.

3) CPU advancements have increased supported memory capacity and channel count (e.g., from two to six channels), raising total capacity, frequency, and the number of DIMMs, which in turn raises the probability of memory‑related failures. Statistics show that memory‑induced outages constitute a large proportion of total downtime.

Therefore, mitigating hardware‑induced unexpected outages begins with reducing memory failure rates.

Analysis of typical product‑line hardware outage data revealed that Uncorrectable Errors (UCE) account for more than half of memory‑related outages, so the team focused on lowering memory UCE occurrences. Their current work includes:

Enhanced memory fault isolation to prevent degradation.

Graded handling of memory errors.

Server fault diagnosis system.

Future initiatives will address:

Resource‑level / hardware‑level fault isolation.

Memory fault prediction.

HDD & SSD fault prediction.

Why enhance isolation of correctable memory errors (CE)?

Modern memory modules already support ECC, automatically correcting single‑bit errors without affecting the system. However, the presence of correctable errors (CE) increases the likelihood of subsequent uncorrectable errors (UE). The Google paper "DRAM Errors in the Wild: A Large‑Scale Field Study" shows that a DIMM with CEs in a month has a higher probability of UE in the same or following month.

When a memory module experiences CEs, the chance of both CE and UE rises. CE handling consumes OS resources, while UE on current Intel Xeon processors can cause immediate crashes. Enhanced isolation processes memory units that have shown errors, removing their risk to the system and improving reliability beyond what standard ECC provides.

Linux already offers a complete memory‑fault management mechanism based on Machine Check Exceptions (MCE) and the mcelog service. The MCE isolation actions are illustrated below:

Although Linux scripts and interfaces mitigate many memory faults, they cannot handle severe issues that cause the OS to crash before detection. These are a major source of server downtime.

To overcome this, Alibaba's Infrastructure Server Innovation team created a novel mechanism that combines in‑band and out‑of‑band techniques, building on the OS to deliver enhanced memory fault isolation:

— When minor memory faults are detected, the system correlates the physical locations of erroneous cells, tracks fault frequency and rate of change, and applies a leaky‑bucket algorithm together with an AI self‑learning model to assess whether the fault may worsen, deciding if isolation is needed.

— Coordination with Alibaba’s kernel and firmware ensures isolation succeeds without impacting OS or business software.

— Early processing of faulty memory pages stabilizes services, reduces downtime, and has contributed to a 25% improvement in outage rates for Alibaba’s servers.

Summary

Addressing the shortcomings of the Linux mcelog service, the Infrastructure Server Innovation team implemented the above enhancements. By effectively isolating faulty memory pages, they prevent the escalation to uncorrectable errors, thereby lowering outage rates and improving system stability.

Future work will map fault addresses to specific rows, columns, banks, and ranks on DIMMs for finer‑grained isolation and will use predictive algorithms to proactively isolate neighboring cells.

The core technology has been patented.

Recruitment Notices

Server R&D Division – Server Testing and Data‑Driven Expert Recruitment

Server R&D Division – Hardware/Software System Optimization and Innovation Expert Recruitment

cloud infrastructureserver reliabilityhardware reliabilityAI predictionmemory fault isolation
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.