Tag

SLA

1 views collected around this technical thread.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Jun 1, 2025 · Operations

Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems

This article explains the differences between SLA, SLO, and SLI, shows how to express user expectations as concrete service level agreements, and introduces essential high‑availability metrics such as availability percentages, MTBF, MTTR, RPO, RTO, WRT, and MTD for reliable system design.

High AvailabilitySLASLI
0 likes · 9 min read
Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems
Efficient Ops
Efficient Ops
May 18, 2025 · Operations

Mastering API Latency: What P90, P95, P99 and SLA Really Mean

This article explains key performance metrics such as API latency, SLA commitments, and percentile indicators P90, P95, and P99, illustrating how to calculate and interpret these values along with average and maximum latency to improve system reliability and user experience.

Performance MonitoringSLAapi latency
0 likes · 5 min read
Mastering API Latency: What P90, P95, P99 and SLA Really Mean
DaTaobao Tech
DaTaobao Tech
Dec 25, 2024 · Operations

Fundamentals of Service Level Agreements (SLA) for Messaging Middleware

The article explains SLA fundamentals for messaging middleware, defining contracts, SLI/SLO relationships, key metrics such as availability, latency and error‑rate, dynamic lifecycle processes, template components, error‑budget calculations, industry benchmarks, internal monitoring practices, a sample SLA draft, and best‑practice recommendations for continuous improvement.

Messaging MiddlewareReliabilitySLA
0 likes · 41 min read
Fundamentals of Service Level Agreements (SLA) for Messaging Middleware
Efficient Ops
Efficient Ops
Dec 16, 2024 · Operations

How to Justify and Charge for IT Operations When the System Seems Too Stable

When a client claims that a highly stable system leaves no work for the operations team, this article shares practical strategies—such as creating visible workload metrics, using SLA‑based reporting, introducing controlled incidents, and adding value‑added services—to demonstrate the necessity of maintenance fees.

IT consultingSLAmaintenance pricing
0 likes · 7 min read
How to Justify and Charge for IT Operations When the System Seems Too Stable
DevOps
DevOps
Oct 26, 2023 · Operations

Design and Implementation of SLA for Object Storage Services

This article explains how to design SLA metrics for object storage services, describes the S3 protocol, proposes availability calculations, outlines monitoring and alerting rules, and provides practical implementation examples using s3cmd, Python boto, and Java SDK to ensure reliable cloud storage operations.

DevOpsSLAcloud
0 likes · 16 min read
Design and Implementation of SLA for Object Storage Services
Architects Research Society
Architects Research Society
Sep 7, 2023 · Fundamentals

Chapter 1: Foundations of Enterprise Architecture

This article introduces the fundamentals of enterprise architecture, defining its scope, reference models, and maturity stages, and explains how architects manage complexity, control costs, ensure SLA compliance, and apply iterative, partitioning, and simplification techniques to modernize enterprise IT systems.

Cost OptimizationIT governanceModernization
0 likes · 14 min read
Chapter 1: Foundations of Enterprise Architecture
ByteDance Data Platform
ByteDance Data Platform
Aug 30, 2023 · Big Data

How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap

The article details how the "Xingfu Li" real‑estate platform tackled a 13‑day offline data‑warehouse SLA delay by adopting Volcano Engine's DataLeap suite, outlining the challenges, the three‑step governance process, and the measurable improvements achieved across task coverage, alert reduction, and data stability.

Big DataData WarehouseDataLeap
0 likes · 10 min read
How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap
DeWu Technology
DeWu Technology
Feb 27, 2023 · Operations

Message Push Monitoring and SLA Practices

The team implemented SLA‑based, node‑level monitoring for mobile push messages—splitting the workflow, measuring latency, blocking volume, and success rates, isolating metrics with Spring AOP, and tracking third‑party vendors—resulting in clear latency standards, doubled peak throughput, faster issue resolution, and improved overall reliability.

SLAbackendmessage-push
0 likes · 11 min read
Message Push Monitoring and SLA Practices
ByteDance Data Platform
ByteDance Data Platform
Aug 24, 2022 · Big Data

How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation

This article explains ByteDance's end‑to‑end data‑point (埋点) validation system, covering its technical challenges—usability, accuracy, real‑time visibility, stability, and extensibility—along with SDK integration, QR‑code workflow, JSON‑Schema verification, push‑service architecture, SLA metrics, and future automation plans.

Big DataJSON SchemaReal-time Monitoring
0 likes · 11 min read
How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation
DataFunSummit
DataFunSummit
Aug 11, 2022 · Big Data

Huya Data Platform: Cost Reduction and SLA Strategies

This article presents Huya's big data platform evolution, detailing cost‑saving measures, SLA practices, multi‑datacenter architecture, containerized resources, metadata‑driven intelligence, and future directions such as hybrid‑engine materialized views to improve efficiency and service reliability.

Big DataCost OptimizationSLA
0 likes · 15 min read
Huya Data Platform: Cost Reduction and SLA Strategies
Architect's Guide
Architect's Guide
Aug 2, 2022 · Operations

Understanding Service Degradation and Its Practical Strategies

This article explains the concept of service degradation, defines SLA levels, and details various degradation techniques—including fallback data, rate‑limiting, timeout handling, circuit‑breaker retries, and front‑end/ back‑end strategies—to maintain high availability during traffic spikes or component failures.

FallbackRate LimitingSLA
0 likes · 13 min read
Understanding Service Degradation and Its Practical Strategies
DeWu Technology
DeWu Technology
May 16, 2022 · Operations

NOC SLA Implementation for Consumer Trading Platform

To tackle growing production complexity and past incident delays, the consumer trading platform introduced a three‑tier NOC‑SLA with intelligent baselines powered by Facebook Prophet, streamlined alert rules, and an SOS‑linked workflow, boosting detection frequency, cutting critical response times to under five minutes, and improving overall system reliability while emphasizing ongoing baseline and rule maintenance.

NOCSLAalert management
0 likes · 13 min read
NOC SLA Implementation for Consumer Trading Platform
ByteDance Data Platform
ByteDance Data Platform
May 16, 2022 · Operations

How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale

This article explains how ByteDance’s self‑built SLA assurance platform addresses data pipeline communication costs, unclear responsibilities, and operational pressure by introducing roles, a streamlined signing workflow, checkpoint and recommendation calculations, and real‑time monitoring to achieve a 99.1% SLA compliance rate.

Big DataSLAWorkflow Automation
0 likes · 9 min read
How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale
Architecture Digest
Architecture Digest
May 8, 2022 · Fundamentals

Building Robust Distributed Systems: Reducing Dependencies and Enhancing Resilience

The article explains how to design resilient distributed systems by minimizing inter‑component dependencies, duplicating or denormalizing data, isolating failures with SLAs, protecting callers and callees, and adding buffers such as asynchronous messaging and elastic scaling to handle random faults as systems grow.

Asynchronous CommunicationSLAdistributed systems
0 likes · 8 min read
Building Robust Distributed Systems: Reducing Dependencies and Enhancing Resilience
DataFunSummit
DataFunSummit
Apr 22, 2022 · Big Data

Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

The talk details Huya’s real‑time computing platform evolution from chaotic early stages to a unified, containerized system, defines core SLA metrics focused on latency compliance, describes capability enhancements such as demand monitoring, task analysis, dynamic scaling, and outlines future goals for usability, stability, openness, and unified stream‑batch processing.

Big DataFlinkSLA
0 likes · 12 min read
Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook
DataFunTalk
DataFunTalk
Apr 15, 2022 · Big Data

Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

This article details Huya's real‑time computing platform evolution, core SLA definitions focused on latency compliance, capability enhancements such as demand management, task analysis, dynamic resource scaling, and outlines future directions emphasizing usability, stability, openness, and unified batch‑stream processing.

Big DataFlinkSLA
0 likes · 13 min read
Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook
IT Architects Alliance
IT Architects Alliance
Oct 1, 2021 · Operations

Understanding Service Degradation: Definitions, Levels, and Mitigation Strategies

The article explains service degradation concepts, defines SLA levels and the meaning of six nines, and details various degradation techniques such as fallback data, rate‑limiting, timeout, fault handling, read/write strategies, frontend safeguards, and the use of switches and pre‑embedding to maintain system availability during traffic spikes or failures.

FallbackRate LimitingSLA
0 likes · 12 min read
Understanding Service Degradation: Definitions, Levels, and Mitigation Strategies
Architect
Architect
Sep 11, 2021 · Operations

Understanding Service Degradation and Its Practical Strategies

This article explains the concept of service degradation, its relationship with rate limiting and SLA, and presents various practical mitigation techniques such as fallback data, rate‑limit throttling, timeout handling, fault isolation, retry mechanisms, feature switches, read/write degradation, and front‑end strategies to maintain high availability during traffic spikes or component failures.

FallbackHigh AvailabilityRate Limiting
0 likes · 13 min read
Understanding Service Degradation and Its Practical Strategies