Tag

system reliability

1 views collected around this technical thread.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Jun 1, 2025 · Operations

Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems

This article explains the differences between SLA, SLO, and SLI, shows how to express user expectations as concrete service level agreements, and introduces essential high‑availability metrics such as availability percentages, MTBF, MTTR, RPO, RTO, WRT, and MTD for reliable system design.

High AvailabilitySLASLI
0 likes · 9 min read
Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Apr 6, 2025 · Operations

Mastering Performance Testing: Why It Matters and How to Use wrk Effectively

This article explains what performance testing is, why it is essential for reliable systems, outlines practical steps for conducting effective tests, and introduces the wrk benchmarking tool as a lightweight solution for generating realistic load and measuring key performance metrics.

Benchmarkingload testingoperations
0 likes · 2 min read
Mastering Performance Testing: Why It Matters and How to Use wrk Effectively
FunTester
FunTester
Mar 31, 2025 · Operations

Performance Testing and Fault Testing: Complementary Pillars for System Stability

The article explains how performance testing measures system efficiency under load while fault testing validates resilience under abnormal conditions, highlighting their shared goals, differences, overlapping toolchains, and how their combined use drives architecture optimization and improves service level agreements in modern complex software systems.

fault injectionload testingoperations
0 likes · 14 min read
Performance Testing and Fault Testing: Complementary Pillars for System Stability
Efficient Ops
Efficient Ops
Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

DevOpsSREincident management
0 likes · 6 min read
15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them
Bilibili Tech
Bilibili Tech
Mar 18, 2025 · Operations

Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream

Bilibili’s engineering team built a scenario‑metadata and one‑click fault‑drill platform, implemented multi‑tier degradation, dynamic capacity planning, and extensive automated fault‑injection testing to guarantee zero‑severity incidents during the high‑traffic 2025 Spring Festival Gala live stream.

Live Streamingcapacity planningfault injection
0 likes · 16 min read
Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 13, 2025 · Operations

Automated Load Testing and Circuit Breaker Process for System Stability

To prevent performance degradation as systems scale, the team implemented an automated load‑testing and circuit‑breaker workflow that runs in the release pipeline, compares real‑time metrics against a baseline of CPU, QPS, memory and latency, blocks releases exceeding a 10 % drop, and logs issues, resulting in thousands of tests, dozens of bugs fixed, and up to 90 % faster wordlist creation.

DevOpsautomationcircuit breaker
0 likes · 6 min read
Automated Load Testing and Circuit Breaker Process for System Stability
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Dec 4, 2024 · Operations

Service Reliability Essentials: Rate Limiting, Circuit Breaking & Degradation

This article explains common service problems and presents practical solutions such as rate limiting, circuit breaking, and degradation, detailing their principles, implementation methods—including Nginx, token‑bucket, sliding‑window algorithms, and Go‑zero code examples—while highlighting key considerations for building resilient microservice systems.

MicroservicesRate Limitingcircuit breaker
0 likes · 15 min read
Service Reliability Essentials: Rate Limiting, Circuit Breaking & Degradation
Efficient Ops
Efficient Ops
Nov 14, 2024 · Operations

How SRE Standards Boost System Reliability in China’s Digital Era

Amid a surge of high‑profile outages, the CAICT introduces a comprehensive SRE framework that addresses large‑scale, high‑frequency changes, complex tech stacks, and massive traffic, outlining development and operational reliability practices, maturity levels, and actionable guidelines to enhance system stability.

Digital GovernanceIT ManagementSRE
0 likes · 12 min read
How SRE Standards Boost System Reliability in China’s Digital Era
Efficient Ops
Efficient Ops
Oct 29, 2024 · Operations

Master the Four Golden Signals: A Practical Guide to System Monitoring

Understanding system health is essential for reliable services, and this guide explains how to use powerful monitoring tools to collect, visualize, and alert on the four golden signals—latency, traffic, errors, and saturation—across servers, applications, and external dependencies, helping teams detect and resolve issues efficiently.

MetricsSREmonitoring
0 likes · 17 min read
Master the Four Golden Signals: A Practical Guide to System Monitoring
Efficient Ops
Efficient Ops
Oct 19, 2024 · Operations

How Inner Mongolia Mobile Achieved Leading SRE Maturity – Lessons from the DevOps Assessment

The article explores the growing importance of system reliability in China, the national regulations driving SRE adoption, Inner Mongolia Mobile’s successful Level‑3 SRE assessment at the 2024 GOPS conference, and insights from Deputy GM Zhang Yongtao on practices, challenges, and future plans.

AI OpsDevOpsIT Operations
0 likes · 14 min read
How Inner Mongolia Mobile Achieved Leading SRE Maturity – Lessons from the DevOps Assessment
Bilibili Tech
Bilibili Tech
Sep 6, 2024 · Operations

Design and Implementation of a Cross‑Platform Real‑Time Troubleshooting System for Live Streaming

The team built a cross‑platform real‑time troubleshooting system for live streaming that adds critical‑business monitoring and a unified trace_id‑based tracing framework, simplifies OpenTracing, iterates reporting components, handles multi‑threading, stitches telemetry into searchable event chains, and via dashboards cut diagnosis time from two hours to five minutes, achieving a 91% fault‑resolution rate.

Distributed TracingLive StreamingObservability
0 likes · 15 min read
Design and Implementation of a Cross‑Platform Real‑Time Troubleshooting System for Live Streaming
DevOps Operations Practice
DevOps Operations Practice
Sep 2, 2024 · Operations

How a Strong Operations Team Drives Business Success

In the digital era, a capable IT operations team ensures system stability, reduces costs, accelerates issue resolution, strengthens security, supports product development, and improves user experience, making it a critical driver of overall business value.

Cost EfficiencyDevOpsIT Operations
0 likes · 6 min read
How a Strong Operations Team Drives Business Success
Cognitive Technology Team
Cognitive Technology Team
Aug 25, 2024 · Operations

Fault Isolation Techniques for High Availability in Distributed Systems

The article explains fault isolation as a key technique for improving distributed system availability, detailing multiple isolation levels—from data‑center to user‑level—and complementary strategies such as circuit breakers, timeouts, fast‑fail, load balancing, caching, and degradation switches.

Load Balancingcircuit breakerdegradation
0 likes · 10 min read
Fault Isolation Techniques for High Availability in Distributed Systems
Efficient Ops
Efficient Ops
Aug 21, 2024 · Operations

10 Proven Practices to Prevent System Failures in Operations

This article shares ten practical operations strategies—ranging from change‑rollback procedures and cautious handling of destructive commands to robust backup verification, alerting, and meticulous hand‑over practices—that together help teams dramatically reduce system outages and maintain high availability.

Change ManagementLinuxMySQL
0 likes · 17 min read
10 Proven Practices to Prevent System Failures in Operations
Architect
Architect
Aug 6, 2024 · Operations

Handling Interface-Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing

The article explains what interface‑level failures are, why they occur due to internal bugs or external overload, and presents four practical mitigation techniques—degradation, circuit breaking, rate limiting, and queuing—detailing their principles, implementation options, and trade‑offs for reliable system operation.

QueueRate Limitingcircuit breaker
0 likes · 16 min read
Handling Interface-Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing
IT Services Circle
IT Services Circle
Aug 5, 2024 · Fundamentals

Ariane 5 Rocket Explosion Caused by a Software Integer‑Overflow Bug

The 1996 Ariane 5 launch failed and exploded due to a single line of legacy code that caused a 64‑bit floating‑point to 16‑bit signed integer conversion overflow in the guidance system, highlighting the dangers of unchecked code reuse, inadequate error handling, and insufficient testing in critical software.

data type conversioninteger overflowrocket failure
0 likes · 6 min read
Ariane 5 Rocket Explosion Caused by a Software Integer‑Overflow Bug
Efficient Ops
Efficient Ops
Jul 31, 2024 · Operations

How HuoLala Achieved Zero‑Fault Peaks: A Blueprint for High‑Load System Reliability

This article details HuoLala's three‑year journey of systematic business‑peak assurance, covering goal definition, project‑management practices, technical risk mitigation, cloud‑provider coordination, and post‑event reviews that together delivered zero‑fault high‑traffic periods and continuously improving system stability.

capacity planningoperationspeak load management
0 likes · 20 min read
How HuoLala Achieved Zero‑Fault Peaks: A Blueprint for High‑Load System Reliability
Efficient Ops
Efficient Ops
Jun 25, 2024 · Operations

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

This guide explains how to use the four golden signals—latency, traffic, errors, and saturation—to design effective monitoring across servers, services, and external dependencies, helping teams detect issues early and maintain reliable, high‑performance systems.

MetricsSREmonitoring
0 likes · 20 min read
Mastering the Four Golden Signals: A Practical Guide to System Monitoring
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

SREavailabilityfault injection
0 likes · 7 min read
From Firefighting to Fire‑Starting: Mastering Operations for System Reliability
iQIYI Technical Product Team
iQIYI Technical Product Team
May 10, 2024 · Operations

Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes

iQIYI implemented full‑link load testing of its playback service using LoadMaker for traffic generation and Rover for link control, mapping the topology, creating weighted user scenarios, and safely pressurizing production‑like environments, which validated multi‑times historical peak capacity, uncovered bottlenecks, and enabled several performance and disaster‑recovery improvements without impacting real users.

capacity planningiQIYIload testing
0 likes · 10 min read
Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes