Tag

server reliability

1 views collected around this technical thread.

vivo Internet Technology
vivo Internet Technology
Jul 20, 2022 · Operations

Applying the EDAC Framework for Memory Error Detection and Prediction in Vivo Servers

The article explains how Vivo’s data centers use the Linux EDAC framework—combined with sysfs mapping, driver selection, and APEI error‑injection testing—to monitor per‑DIMM correctable and uncorrectable errors, enabling early fault prediction, reducing server crashes, and supporting broader RAS strategies.

EDACError InjectionMemory Error Detection
0 likes · 16 min read
Applying the EDAC Framework for Memory Error Detection and Prediction in Vivo Servers
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 16, 2018 · Operations

Improving Server Reliability by Reducing Memory Faults: Alibaba's Memory Fault Isolation Enhancements

The article explains how Alibaba's infrastructure team tackles unexpected server outages caused by memory hardware failures by enhancing memory fault isolation, using AI‑driven prediction, hardware‑level segregation, and improved diagnostics to boost overall system stability and reduce downtime.

AI predictioncloud infrastructurehardware reliability
0 likes · 11 min read
Improving Server Reliability by Reducing Memory Faults: Alibaba's Memory Fault Isolation Enhancements
Efficient Ops
Efficient Ops
Jan 18, 2017 · Databases

What the Hearthstone Database Rollback Reveals About MySQL Backup Best Practices

This article examines the Hearthstone database rollback incident, highlighting the timeline, power‑related hardware failures, backup shortcomings, and offers concrete MySQL backup and server‑monitoring recommendations to prevent similar data loss in game operations.

Data recoveryMySQLdatabase backup
0 likes · 5 min read
What the Hearthstone Database Rollback Reveals About MySQL Backup Best Practices
Efficient Ops
Efficient Ops
Aug 24, 2015 · Operations

Can Cloud Servers Be More Reliable Than Physical Machines? Insights and Strategies

This article examines why cloud virtual machines can achieve lower failure rates than physical servers by leveraging large‑scale operations, kernel optimization, hot‑patching, live migration, and proactive hardware quality management, while highlighting the challenges smaller organizations face.

Virtualizationcloud computinghardware maintenance
0 likes · 11 min read
Can Cloud Servers Be More Reliable Than Physical Machines? Insights and Strategies