Tag

service reliability

1 views collected around this technical thread.

Baidu Tech Salon
Baidu Tech Salon
Feb 20, 2025 · Backend Development

Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions

Baidu Netdisk engineers protect its billion‑user service from cascading failures by deploying dynamic circuit‑breaker overload control, priority‑based traffic isolation, request‑validity filtering, socket‑level disconnect detection, and unified timestamp handling, a combination that dramatically reduces avalanche incidents and boosts overall availability.

avalanche preventionbackend architecturecircuit breaker
0 likes · 17 min read
Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions
DevOps
DevOps
Aug 22, 2024 · Operations

Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability

This article explains why service stability is critical, outlines the importance and key factors of synthetic monitoring, provides practical guidelines for implementing it, and then describes fault‑drill concepts, benefits, processes, and common cloud‑native tools to proactively discover and mitigate failures in micro‑service environments.

Cloud NativeDevOpsFault Injection
0 likes · 11 min read
Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jun 18, 2024 · Backend Development

Graceful Shutdown in Go: Designing Robust Service Termination with the GS Library

This article describes a real‑world incident where rapid pod scaling caused order‑submission failures in a serverless e‑commerce platform, analyzes the root causes, and presents a Go‑based graceful‑shutdown solution—including ASyncClose, SyncClose, and ForceSyncClose modes—implemented in the open‑source GS library to help developers reliably terminate services.

GoGraceful ShutdownTerminateSignal
0 likes · 21 min read
Graceful Shutdown in Go: Designing Robust Service Termination with the GS Library
Efficient Ops
Efficient Ops
Apr 23, 2024 · Cloud Computing

Why Do Cloud Outages Keep Happening? Governance Lessons and Strategies

The article examines the rapid growth of China's cloud market, the frequent "cloud collapse" incidents, their root causes in governance failures, and presents practical cloud governance measures along with an overview of the new industry standard for enterprise cloud governance capability maturity.

cloud computingcloud governanceindustry standards
0 likes · 8 min read
Why Do Cloud Outages Keep Happening? Governance Lessons and Strategies
Wukong Talks Architecture
Wukong Talks Architecture
Apr 15, 2024 · Operations

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

On April 8, a Tencent Cloud API outage caused console login failures for nearly 2,000 customers, affecting several dependent services for 87 minutes, and the detailed root‑cause analysis and subsequent improvement actions are presented to enhance system resilience and change management.

APIIncident ResponseTencent Cloud
0 likes · 8 min read
Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Nov 3, 2023 · Operations

Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling

The article outlines how game QA and third‑party providers can improve cooperation by aligning basic performance concepts such as TPS, QPS and concurrency, selecting appropriate rate‑limiting strategies, establishing precise monitoring and alerting, and preparing clear incident‑response and delivery standards.

MonitoringRate Limitingoperations
0 likes · 15 min read
Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling
Didi Tech
Didi Tech
Jul 25, 2023 · Backend Development

Separating Test Traffic Trigger and Result Verification for Didi Ride‑Hailing Backend

By separating test‑traffic triggering from result verification, Didi’s ride‑hailing backend uses live‑traffic inspection and replayed offline tests with bucketed validation rules to achieve near‑zero‑cost, full‑coverage QA, catching hundreds of bugs annually and dramatically improving service reliability for drivers and passengers.

Ride-hailingbackend testingquality assurance
0 likes · 18 min read
Separating Test Traffic Trigger and Result Verification for Didi Ride‑Hailing Backend
Test Development Learning Exchange
Test Development Learning Exchange
May 25, 2023 · Operations

Online Incident Severity Level Definition Rules

This document defines the online incident severity grading system, outlining fault categories, influencing factors such as business metrics, capital loss, user impact, and public opinion, and presents detailed P0‑P3 grading rules with tables for capital‑based, C‑end, and B‑end user classifications.

fault classificationincident managementoperations
0 likes · 8 min read
Online Incident Severity Level Definition Rules
DaTaobao Tech
DaTaobao Tech
May 12, 2023 · Backend Development

Backend Development Journey and Lessons from Alibaba Taobao

Through a five‑year backend journey—from building a solo startup site and mastering Java, to handling high‑traffic services at Sina Weibo, and now developing B2B merchant tools at Alibaba Taobao—the author shares lessons on scalable architecture, automated deployment, aligning tech with business, proactive problem‑solving, code quality, teamwork, and career health.

backend developmentcareer growthservice reliability
0 likes · 9 min read
Backend Development Journey and Lessons from Alibaba Taobao
vivo Internet Technology
vivo Internet Technology
Jan 4, 2023 · Artificial Intelligence

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.

AIOpsMonitoringalgorithm
0 likes · 13 min read
Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis
HaoDF Tech Team
HaoDF Tech Team
Nov 8, 2021 · Operations

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.

SRElatency optimizationmetrics-driven development
0 likes · 16 min read
Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop
Efficient Ops
Efficient Ops
Jul 11, 2021 · Operations

Mastering Incident Management: Principles and Methods for Effective Fault Handling

This guide outlines essential incident management principles—prioritizing business recovery and timely escalation—and presents practical methodologies such as restart, isolation, and downgrade, while detailing user impact handling, organizational roles, and post‑mortem best practices.

escalationfault handlingincident management
0 likes · 10 min read
Mastering Incident Management: Principles and Methods for Effective Fault Handling
Efficient Ops
Efficient Ops
Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

fault handlingincident managementoperations
0 likes · 10 min read
Mastering Incident Management: Core Principles and Practical Methods
Didi Tech
Didi Tech
Jun 3, 2020 · Backend Development

Stability Guidelines and Anti‑Patterns for Backend Services

Drawing on five years of incident reviews, the article defines a comprehensive stability framework for backend services—mandating timeout hierarchies, weak dependencies, service-discovery integration, staged gray releases, robust monitoring, capacity planning, and strict change management—while cataloguing common anti-patterns such as over-aggressive circuit breaking, static retries, improper timeouts, tight coupling, and insufficient isolation, and urging regular rehearsal of these practices.

Monitoringbackend stabilitydeployment best practices
0 likes · 21 min read
Stability Guidelines and Anti‑Patterns for Backend Services
Qunar Tech Salon
Qunar Tech Salon
Jan 7, 2020 · Operations

Comprehensive Dependency Governance for High‑Availability Backend Systems

This article outlines a systematic approach to dependency governance in high‑traffic backend services, covering service classification, rate limiting, Dubbo, HTTP, database, and message‑queue management to enhance availability, reduce failure impact, and improve overall system stability.

DubboRate Limitingbackend architecture
0 likes · 10 min read
Comprehensive Dependency Governance for High‑Availability Backend Systems
Didi Tech
Didi Tech
Dec 2, 2019 · Operations

Capacity Estimation Methodology for Growing Services

The article presents a systematic capacity‑estimation methodology that links service traffic to order volume, uses CPU‑Idle as a primary metric, predicts traffic growth and upper‑bound limits, validates predictions with load‑testing, and provides scaling recommendations while noting limitations of the CPU‑Idle baseline.

Performance MonitoringScalingcapacity planning
0 likes · 9 min read
Capacity Estimation Methodology for Growing Services
JD Retail Technology
JD Retail Technology
Oct 15, 2019 · Operations

Traffic Replication and Replay Platform for JD APP: Design, Features, and Operational Impact

The article describes JD's traffic replication and replay platform, explaining its background, the concepts of traffic copying and replay, detailed platform architecture and features, normalised load testing workflow, dynamic regression testing, operational results, current limitations, and future improvement directions.

AutomationJD platformload testing
0 likes · 11 min read
Traffic Replication and Replay Platform for JD APP: Design, Features, and Operational Impact
Ctrip Technology
Ctrip Technology
Apr 18, 2019 · Operations

Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations

The article explains why enterprise application monitoring systems are essential, outlines their core components such as Trace, Log, Metric, and Report, discusses distributed tracing techniques, and describes how these insights are designed to aid developers, testers, and operations engineers in performance tuning and fault diagnosis.

Distributed TracingObservabilityPerformance analysis
0 likes · 12 min read
Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations
Didi Tech
Didi Tech
Jan 7, 2019 · Operations

Data‑Driven Risk Quantification Platform for SRE at Didi

Didi’s data‑driven Risk Quantification Platform assigns numeric Change Credit and Monitoring Health scores to deployments, alerts and core services, turning operational best‑practice adoption into a competitive game that has raised scores, cut incident rates despite higher change volume, and paves the way for broader risk‑management across the organization.

MonitoringRisk QuantificationSRE
0 likes · 9 min read
Data‑Driven Risk Quantification Platform for SRE at Didi