Applying the VALET Model for SRE Transformation at Home Depot (THD)
The article explains how Home Depot (THD) adopted the VALET model—a five‑dimensional SLO language covering Volume, Availability, Latency, Error, and Ticket—to unify communication, automate data collection, and improve reliability across its massive retail and e‑commerce infrastructure.
The source material, originally from Chapter 3 of the English edition of the SRE Handbook , describes how Home Depot (THD), the world’s largest home‑improvement retailer, used the VALET model during its Site Reliability Engineering (SRE) transformation.
VALET Definition
V olume – traffic capacity
A vailability – ability to start the service on demand
L atency – response speed of the service
E rror – occurrence of errors during use
T icket – need for manual intervention to complete a request
By answering these five questions for each dependent service, teams gain a transparent, consistent view of reliability expectations.
1. Original State
Home Depot’s monitoring tools and dashboards were fragmented, making root‑cause analysis time‑consuming. Planned outages were often unknown to dependent services, and SLOs (e.g., 99.9%) were set without visibility into whether upstream services could meet stricter targets such as four‑nines.
2. Establishing a Common Language
The company introduced a unified language—VALET—to standardize metrics (traffic, waiting time, errors, utilization) and to use support‑ticket volume as a customer‑facing reliability indicator.
3. Automatic VALET Data Collection
Home Depot built a framework called the “TPS Report” that automatically captures VALET data from services deployed in the cloud. Logs are streamed to BigQuery, where they are combined with other monitoring sources (e.g., Stackdriver probes) and transformed into hourly VALET metrics. The data are stored in a Cloud SQL database and can be queried, visualized, or accessed via a chat‑bot.
4. VALET Service
A dedicated VALET application stores and reports SLO data, aggregating alerts from various monitoring platforms for trend analysis. Although alert thresholds are not directly tied to SLOs, the service allows flexible adjustments.
VALET Dashboard
The dashboard (see image) visualizes VALET metrics, enabling users to register new services, set SLO targets for any of the five VALET categories, and add custom metric types (e.g., P99 latency, daily transaction volume). It supports slicing and dicing data, weekly/monthly SLO reviews, and generating operational action items.
5. Applying VALET to Batch Processing
The same VALET dimensions are adapted for batch jobs: Capacity (records processed), Availability (percentage of jobs completed on time), Latency (job runtime), Error (records that failed), and Ticket (manual interventions required).
6. Communicating SLOs to Product Managers
While engineers find VALET intuitive, translating its metrics into business terms for product managers remains a challenge. Bridging this gap is essential to align expectations and reduce reliability mismatches in large organizations.
7. Next Organizational Challenges
The article concludes by highlighting the need to further integrate VALET concepts across product and engineering teams, ensuring shared visibility of SLOs and fostering a culture of reliability.
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.