Tag

fault-analysis

1 views collected around this technical thread.

vivo Internet Technology
vivo Internet Technology
Jul 19, 2023 · Databases

Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch

A service‑wide avalanche occurred when a Redis 3.x master‑slave failover coincided with Jedis’ default 2‑second connection timeout and six retry attempts, causing up to 60‑second latencies; adjusting connectionTimeout, soTimeout to 100 ms and reducing maxAttempts to two limited latency to about one second and prevented cascade failures.

ClusterConnection RetryJedis
0 likes · 13 min read
Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch
Architecture Digest
Architecture Digest
May 25, 2022 · Big Data

Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior

This article explains the design of a multi‑tenant Kafka cluster, the business onboarding process, detailed fault symptoms and monitoring metrics, analyzes the root cause of a topic‑wide traffic drop, and examines the default partitioner’s rules to propose mitigation recommendations.

Big DataClusterKafka
0 likes · 11 min read
Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior
vivo Internet Technology
vivo Internet Technology
May 18, 2022 · Backend Development

Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

A Kafka cluster at vivo suffered a total traffic drop across a resource group when a broker’s disk failed, because the default producer partitioner still hashed keys to the failed partition, exhausting client buffers and blocking all healthy partitions, prompting recommendations to avoid keys or use custom partitioners.

Distributed SystemsKafkabackend development
0 likes · 9 min read
Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 5, 2021 · Operations

Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques

This article details Baidu Search's high‑availability engineering, describing eight major challenges in fault analysis and the corresponding innovations—index mirroring, streaming analysis, comprehensive label sets, feature engineering, query reconstruction, intelligent ranking, timeline analysis, and chaos engineering—that together enable near‑real‑time, 99% accurate detection and mitigation of search service failures.

Big Datafault-analysisobservability
0 likes · 13 min read
Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques
Baidu Geek Talk
Baidu Geek Talk
Jul 5, 2021 · Operations

Automated and Intelligent Analysis of Baidu Search Stability Issues

The team automated Baidu Search fault diagnosis by building a side‑index for instant log lookup, streaming incremental analysis, exhaustive rule templates, feature‑engineering pipelines, query‑scene reconstruction, entropy‑based ranking, per‑second timeline views, and chaos‑engineered fault injection, achieving near‑99% accuracy and second‑level, module‑granular stability tracing.

Chaos EngineeringStream Processingfault-analysis
0 likes · 15 min read
Automated and Intelligent Analysis of Baidu Search Stability Issues
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 13, 2020 · Operations

Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform

iQIYI’s Consul‑based service registry, tightly integrated with its QAE container platform and API gateway, suffered a multi‑DC outage caused by network jitter and a metrics‑library lock‑contention bug, which was resolved by upgrading Go, go‑metrics, and Raft, adding extensive monitoring, redundant DC registration, and dedicated per‑gateway Consul clusters to ensure continued stability and scalability.

ConsulService Registryfault-analysis
0 likes · 17 min read
Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform
Tencent Cloud Developer
Tencent Cloud Developer
May 16, 2019 · Operations

TDSQL Intelligent Operation Platform – Bianque Architecture and Practice

Bianque, TDSQL’s intelligent operation platform, automatically collects and indexes database metrics, applies a knowledge‑base‑driven analysis engine to diagnose availability, performance and reliability issues, issue risk warnings and optimization recommendations, dramatically cutting DBA effort and support tickets across Tencent’s cloud services.

AutomationDatabase OperationsIntelligent Diagnosis
0 likes · 17 min read
TDSQL Intelligent Operation Platform – Bianque Architecture and Practice
Efficient Ops
Efficient Ops
May 17, 2016 · Operations

When a Single Cable Crashes a Network: Real Ops Incident Lessons

This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.

ITILNFSNetwork Troubleshooting
0 likes · 11 min read
When a Single Cable Crashes a Network: Real Ops Incident Lessons
Efficient Ops
Efficient Ops
Dec 17, 2015 · Operations

Tackling QQ’s Legacy Ops: Automation, Capacity Management & Fault Analysis

This article shares Tencent’s QQ operations team insights on handling legacy issues, standardizing package and configuration management, leveraging the ZhiYun automation platform, and applying capacity management and fault‑root analysis techniques to boost efficiency and reduce costs.

Automationcapacity-managementdevops
0 likes · 10 min read
Tackling QQ’s Legacy Ops: Automation, Capacity Management & Fault Analysis