Why Durability and Availability Matter: Uncovering the Real Meaning Behind Storage Reliability
This article demystifies reliability by clarifying the difference between durability and availability, exposing common misconceptions about MTBF, analyzing real‑world disk failure data, and presenting a practical formula for calculating the health probability of distributed storage systems.
Fundamental concepts such as reliability are often misunderstood; terms like “durability” and “availability” are frequently confused or misused.
1. Durability vs. Availability
Reliability is a vague notion that actually encompasses two layers: durability and availability. AWS S3 advertises 99.999999999% durability and 99.99% availability per year, a precise statement that many readers overlook.
Durability (the ability to retrieve data after a period of inaccessibility) originates from database terminology, while availability simply means the data can be accessed at the moment.
In practice, durability is a prerequisite for availability, so durability ≥ availability.
Real‑world systems also involve clusters, disaster recovery, and other factors, but the core idea remains the same.
2. Time Boundaries and Failure Patterns
Availability percentages are usually expressed per year (e.g., 99.9% means less than 8.76 hours of downtime annually). However, hardware lifespans and failure rates vary over time.
Mean Time Between Failures (MTBF) is often exaggerated; manufacturers claim millions of hours, but real‑world data shows annual failure rates (AFR) of 3‑8% for both enterprise and desktop disks.
Google’s study of tens of thousands of disks revealed a U‑shaped failure distribution: many failures occur within the first three months of deployment and again near the end of the warranty period, with a relatively quiet middle phase.
Environmental stress and workload significantly affect disk lifespan; heavily loaded data centers see disks aging in about two years.
3. Quantifying Reliability
Many online reliability formulas are incorrect. For a simple serial system (e.g., RAID‑0), the system health probability is the product of the health probabilities of each component. For a parallel system (multiple replicas), the system failure probability is the product of the failure probabilities, and health probability is 1 minus that.
Consider an n‑node distributed storage system where each node holds m disks, data is protected with k replicas across nodes, and each disk has a yearly health probability p. The overall system health probability H depends on p, n, m, k, and the downgrade window (t + τ), where t is replacement time and τ is rebuild time.
t and τ are measured in days. The formula assumes a uniform failure distribution over the year, ignoring the U‑shaped pattern for simplicity.
Understanding these factors allows engineers to model system reliability more accurately and to design appropriate redundancy and repair strategies.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.