Operations 13 min read

Why Durability and Availability Matter: Uncovering the Real Meaning Behind Storage Reliability

This article demystifies reliability by clarifying the difference between durability and availability, exposing common misconceptions about MTBF, analyzing real‑world disk failure data, and presenting a practical formula for calculating the health probability of distributed storage systems.

Efficient Ops

Apr 10, 2018

Why Durability and Availability Matter: Uncovering the Real Meaning Behind Storage Reliability

Fundamental concepts such as reliability are often misunderstood; terms like “durability” and “availability” are frequently confused or misused.

1. Durability vs. Availability

Reliability is a vague notion that actually encompasses two layers: durability and availability. AWS S3 advertises 99.999999999% durability and 99.99% availability per year, a precise statement that many readers overlook.

Durability (the ability to retrieve data after a period of inaccessibility) originates from database terminology, while availability simply means the data can be accessed at the moment.

In practice, durability is a prerequisite for availability, so durability ≥ availability.

Real‑world systems also involve clusters, disaster recovery, and other factors, but the core idea remains the same.

2. Time Boundaries and Failure Patterns

Availability percentages are usually expressed per year (e.g., 99.9% means less than 8.76 hours of downtime annually). However, hardware lifespans and failure rates vary over time.

Mean Time Between Failures (MTBF) is often exaggerated; manufacturers claim millions of hours, but real‑world data shows annual failure rates (AFR) of 3‑8% for both enterprise and desktop disks.

Google’s study of tens of thousands of disks revealed a U‑shaped failure distribution: many failures occur within the first three months of deployment and again near the end of the warranty period, with a relatively quiet middle phase.

Environmental stress and workload significantly affect disk lifespan; heavily loaded data centers see disks aging in about two years.

3. Quantifying Reliability

Many online reliability formulas are incorrect. For a simple serial system (e.g., RAID‑0), the system health probability is the product of the health probabilities of each component. For a parallel system (multiple replicas), the system failure probability is the product of the failure probabilities, and health probability is 1 minus that.

Consider an n‑node distributed storage system where each node holds m disks, data is protected with k replicas across nodes, and each disk has a yearly health probability p. The overall system health probability H depends on p, n, m, k, and the downgrade window (t + τ), where t is replacement time and τ is rebuild time.

t and τ are measured in days. The formula assumes a uniform failure distribution over the year, ignoring the U‑shaped pattern for simplicity.

Understanding these factors allows engineers to model system reliability more accurately and to design appropriate redundancy and repair strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Reliability storage Availability MTBF durability

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.