Cloud Computing 4 min read

LIBRA and CARE: Memory Bandwidth Management and Fault‑Tolerance Innovations Presented at HPCA 2021

The article reviews two HPCA 2021 papers from Alibaba Cloud—LIBRA, a dynamic memory‑bandwidth management framework that boosts data‑center utilization, and CARE, a cache‑based fault‑tolerance architecture that delivers near‑Chipkill reliability with minimal overhead—while also highlighting future research directions in ML systems, quantum computing, and cache computing.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
LIBRA and CARE: Memory Bandwidth Management and Fault‑Tolerance Innovations Presented at HPCA 2021

HPCA (High‑Performance Computer Architecture) is one of the most prestigious conferences in computer architecture and high‑performance computing. Two recent papers authored by Alibaba Cloud infrastructure experts were presented at HPCA 2021, focusing on data‑center resource utilization and reliability.

LIBRA addresses the challenge of allocating shared memory‑bandwidth in heterogeneous data‑center workloads. Existing server‑chip bandwidth controls suffer from poor flexibility and slow response, leading to resource waste. LIBRA introduces a novel Dynamic‑Rate‑Control (DRC) technique that dynamically throttles low‑priority jobs, dramatically improving the performance of high‑priority workloads, increasing server utilization, and reducing total‑cost‑of‑ownership.

CARE (Coordinated Augmentation for Elastic Resilience on DRAM Errors) tackles the growing impact of memory errors as compute density and DRAM capacity rise. Traditional ECC solutions impose large performance, power, or capacity penalties, or require extensive system changes. CARE adds a cache‑like structure inside the memory controller to collect error statistics and perform proactive correction, achieving reliability close to Chipkill without sacrificing memory capacity and with negligible performance overhead.

The sharing session concluded with the experts’ outlook on future directions in computer architecture: continued momentum for machine‑learning systems and accelerators, rapid advances in quantum computing and its integration into architectural research, and ongoing challenges in cache computing.

cloud computingresource utilizationmemory bandwidthdata center reliabilityHPCA2021memory error correction
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.