Operations 27 min read

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

The article reviews classic availability metrics such as Success‑Ratio, Incident‑Ratio, MTTR/MTTF, Error‑Budget, and SLA/SLO, then introduces User‑Uptime—a per‑user success time proportion that ignores long idle periods—and its windowed variant, showing how it complements existing indicators for more user‑centric reliability insight.

NetEase Yanxuan Technology Product Team

Nov 14, 2022

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

There are many ways to quantify the availability of Internet products. Traditional metrics include Success‑Ratio (count‑based), Error‑Budget (goal‑based), MTTR/MTTF (failure‑time based), SLA/SLO (threshold‑based), and simple Up/Down status. A newer metric, User‑Uptime (and its derived Windowed User‑Uptime), has been proposed to address shortcomings of existing indicators.

1. Improving vs. Quantifying Availability

Teams typically focus on two activities: improving system stability (optimising code, architecture, and infrastructure) and quantifying stability (collecting success rates, latency, etc.) to answer questions such as “Is the product down?” or “How long was it down this month?”

2. Classic Quantitative Metrics

2.1 Success‑Ratio – defined as the number of successful requests divided by total requests over a period. It is easy to compute but suffers from ambiguity (different users perceive the same ratio differently) and can be misleading when high‑frequency users dominate the denominator.

2.2 Incident‑Ratio – the proportion of up minutes to total minutes. It is intuitive but binary (up/down) and does not capture partial failures in large distributed systems.

2.3 MTTR/MTTF/MTBF – mean times to failure, recovery, and between failures. These provide a macro view of system health but share the binary‑state limitation and can be too coarse for individual incidents.

2.4 Error‑Budget – a pre‑allocated amount of allowable downtime (e.g., 50 minutes per month). It is actionable but suffers from coarse granularity.

2.5 SLA/SLO/SLI – contractual availability targets (SLA) and the specific indicators (SLI) used to measure them (latency, traffic, errors, saturation). Widely accepted but complex to set and maintain.

3. Introducing User‑Uptime

The Google G Suite team sought a metric that is meaningful to users, proportional to user experience, actionable, and does not require manually set thresholds. User‑Uptime is defined as the total time each user experiences successful service divided by the total time each user is active (successful + failed). The formula aggregates per‑user uptime over all users and divides by the aggregated active time.

3.1 Definition

For each user, compute the sum of intervals where the user’s requests succeed (uptime) and where they fail (downtime). Sum these values across all users to obtain the numerator and denominator, then take the ratio.

3.2 Practical Challenges

• Duration Determination : When a successful request is followed by a failure (or vice‑versa), the interval between them must be classified as uptime or downtime. The chosen approach treats the period after a successful request as uptime until the next failure, and the period after a failure as downtime until the next success.

• Inactive Periods : Long gaps between a user’s requests (e.g., the user is offline) should be ignored. Google introduced a “cutoff” time – the 99th percentile of inter‑arrival times (≈30 minutes for Gmail) – to exclude intervals longer than this from the calculation.

3.3 Windowed User‑Uptime and Minimal Cumulative Ratio (MCR)

To capture the worst‑case availability over different time scales, the metric is windowed. For a chosen window length (e.g., 1 min, 5 min, 1 h), the past period is divided into equal windows; the user‑uptime ratio is computed for each window, and the minimum value across all windows is reported as the Windowed User‑Uptime (WUU). Plotting WUU against increasing window sizes yields the Minimal Cumulative Ratio (MCR) curve, which shows the most severe availability observed at each granularity.

4. Relationship Between Metrics

Empirical data from Google G Suite (2019) shows that User‑Uptime consistently reports higher availability than Success‑Ratio, especially when a small fraction of high‑frequency or abusive users generate many failures that depress the success‑ratio. User‑Uptime is more robust to such outliers, while Success‑Ratio remains useful for detecting overall error trends.

Both metrics are complementary: Success‑Ratio highlights error spikes, whereas User‑Uptime reveals the actual impact on end‑users. Combining them (along with SLA/SLO, MTTR, etc.) gives operators a richer “weapon‑set” for reliability engineering.

5. Continuous Innovation

The article concludes that availability measurement is an evolving field. While User‑Uptime is not a universal solution, the mindset of questioning existing metrics and inventing new ones is valuable for any reliability team.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring SRE Reliability Availability user-uptime

Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.