Tag

Self-healing

0 views collected around this technical thread.

Cognitive Technology Team
Cognitive Technology Team
Nov 14, 2024 · Operations

Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems

To ensure distributed applications can recover automatically from hardware, network, or service failures, this guide outlines three core capabilities—fault detection, graceful handling, and monitoring—plus practical strategies such as asynchronous component separation, retries, circuit breakers, isolation, load shedding, failover, compensation, checkpointing, graceful degradation, rate limiting, leader election, fault injection, chaos engineering, and use of availability zones.

Self-healingcloud nativedistributed systems
0 likes · 7 min read
Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems
ByteDance SYS Tech
ByteDance SYS Tech
May 9, 2024 · Operations

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

The presentation explains how large‑model agents empower AIOps by automating routine tasks, enhancing anomaly detection, fault diagnosis, and remediation, while outlining architectural components, multi‑agent collaboration, and future directions for building self‑healing, observability‑driven operations platforms.

AIOpsObservabilitySelf-healing
0 likes · 15 min read
How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations
Efficient Ops
Efficient Ops
Nov 8, 2023 · Operations

How Intelligent Operations (AIOps) Transforms IT Management and Self‑Healing

This article explains what intelligent operations (AIOps) are, outlines a four‑layer platform architecture, and showcases real‑world practices such as load‑balancing link repair, MySQL container self‑healing, composite service tracing, component‑based orchestration, and AI‑driven log analysis, concluding with future prospects.

AIOpsIT OperationsIntelligent Operations
0 likes · 7 min read
How Intelligent Operations (AIOps) Transforms IT Management and Self‑Healing
Baidu Geek Talk
Baidu Geek Talk
Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformResource Scheduling
0 likes · 14 min read
Punica: A Cloud‑Native Platform for Content Understanding Inference Services
Tencent Cloud Developer
Tencent Cloud Developer
Dec 26, 2022 · Cloud Native

Challenges and Optimization Strategies for Containerized Deployment of Online Services on Kubernetes

Tencent’s shift from VMs to Kubernetes for massive online services faces pod‑size rigidity, heterogeneous node balancing, elastic scaling, and massive cluster‑pool mapping, prompting optimizations such as dynamic CPU compression, custom load‑aware scheduling, collaborative HPA/VPA scaling, dynamic quota migration, unified routing‑sync, and an automated decision‑tree‑driven self‑healing workflow for container‑destruction failures.

ContainerizationDynamic SchedulingKubernetes
0 likes · 12 min read
Challenges and Optimization Strategies for Containerized Deployment of Online Services on Kubernetes
Top Architect
Top Architect
Nov 12, 2022 · Cloud Native

Evolution of Ant Financial Service Mesh: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing

The article reviews how Ant Financial’s Service Mesh has evolved after its double‑11 rollout, detailing the implementation of link encryption, adaptive rate limiting, fine‑grained traffic steering, and self‑healing mechanisms that improve security, performance, and reliability across large‑scale microservice deployments.

Adaptive Rate LimitingLink EncryptionSelf-healing
0 likes · 16 min read
Evolution of Ant Financial Service Mesh: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing
DevOps
DevOps
Aug 23, 2022 · Artificial Intelligence

Intelligent Automation Testing: Self‑Healing and Machine‑Learning Techniques

This article reviews the evolution of automated testing toward intelligent solutions, explaining self‑healing mechanisms, machine‑learning‑driven object recognition, computer‑vision and OCR approaches, industry tools such as Healenium and Airtest, and future prospects for zero‑code AI‑powered test automation.

AIAutomation TestingOCR
0 likes · 13 min read
Intelligent Automation Testing: Self‑Healing and Machine‑Learning Techniques
Baidu Intelligent Testing
Baidu Intelligent Testing
Jun 30, 2022 · Operations

Intelligent Test Execution: Risk‑Based Manual Case Recommendation, Parallel‑Coverage Traffic Selection, Smart Build, Priority‑Based Task Scheduling, and UI Automation Self‑Healing

This article presents a comprehensive overview of intelligent test execution techniques, including risk‑based manual test case recommendation, parallel‑coverage traffic filtering, dynamic smart build strategies, priority‑driven task scheduling, and UI automation self‑healing, illustrating how these methods improve testing efficiency, coverage, and stability.

CI/CDSelf-healingintelligent testing
0 likes · 11 min read
Intelligent Test Execution: Risk‑Based Manual Case Recommendation, Parallel‑Coverage Traffic Selection, Smart Build, Priority‑Based Task Scheduling, and UI Automation Self‑Healing
Efficient Ops
Efficient Ops
Mar 28, 2022 · Operations

Zhejiang Mobile’s AI‑Driven Self‑Healing: Pioneering Intelligent Network Operations

This article examines the challenges of intelligent telecom network operation, presents Zhejiang Mobile’s AI‑powered self‑healing practice—including process re‑design, system reconstruction, talent transformation, and measurable results—and outlines the AIOps maturity model and future outlook for digital network management.

AIOpsDigital TransformationIntelligent Operations
0 likes · 11 min read
Zhejiang Mobile’s AI‑Driven Self‑Healing: Pioneering Intelligent Network Operations
HomeTech
HomeTech
Dec 30, 2021 · Operations

Open-falcon in Automotive Home: Application, Architecture, and Customizations

This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.

High AvailabilityOpen-FalconSelf-healing
0 likes · 11 min read
Open-falcon in Automotive Home: Application, Architecture, and Customizations
AntTech
AntTech
Feb 25, 2021 · Cloud Native

Service Mesh Capability Building: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing

The article details Ant Group's large‑scale Service Mesh rollout, explaining the design, implementation, and operational impact of four core capabilities—link encryption, adaptive rate limiting, fine‑grained traffic steering, and service self‑healing—while highlighting performance considerations, deployment challenges, and the overall value of decoupling business logic from infrastructure.

Adaptive Rate LimitingLink EncryptionSelf-healing
0 likes · 16 min read
Service Mesh Capability Building: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

AIOpsBig DataSelf-healing
0 likes · 15 min read
How 360 Scaled AIOps: From Data to Self‑Healing Operations
360 Tech Engineering
360 Tech Engineering
Oct 31, 2019 · Operations

AIOps Implementation Practice at 360: Architecture, Models, and Automation

The article details 360's AIOps deployment, covering external speaker insights, internal architecture, data collection pipelines, AI models for resource recycling, alarm reduction, and correlation, as well as visualization dashboards, labeling platforms, and self‑healing mechanisms, illustrating a comprehensive AI‑driven operations framework.

AI monitoringAIOpsSelf-healing
0 likes · 14 min read
AIOps Implementation Practice at 360: Architecture, Models, and Automation
360 Tech Engineering
360 Tech Engineering
Sep 6, 2019 · Operations

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

This article introduces a StackStorm‑driven ChatOps framework that consolidates monitoring alerts, applies rule‑based root‑cause analysis, and automatically executes self‑healing actions, outlining its architecture, components, workflow definitions, and practical deployment results within an enterprise operations environment.

ChatOpsSelf-healingStackStorm
0 likes · 6 min read
StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing
AntTech
AntTech
Aug 15, 2019 · Cloud Native

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.

Cluster ManagementKubernetesLarge Scale
0 likes · 9 min read
Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System
58 Tech
58 Tech
Mar 25, 2019 · Operations

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The article describes how the 58 monitoring platform reduces alarm storms through alarm convergence, intelligent merging using Gini‑based decision trees, and automated self‑healing, thereby improving alert quality, cutting noise by about 70%, and helping engineers resolve incidents faster.

Self-healingalarm convergencealert merging
0 likes · 9 min read
Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform
Efficient Ops
Efficient Ops
Nov 27, 2018 · Operations

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.

AIOpsSelf-healingcloud computing
0 likes · 17 min read
How Alibaba Automates Server Fault Detection and Self‑Healing at Scale
Efficient Ops
Efficient Ops
Jun 13, 2018 · Operations

Designing an Effective CMDB: Boost Ops Efficiency, Alert Convergence & Self‑Healing

This article explains how a well‑designed CMDB abstracts and models operational objects, categorizes business, hardware, application and custom data, and enables alert convergence and automated fault‑healing, dramatically improving DevOps efficiency and reliability.

Alert ConvergenceCMDBDevOps
0 likes · 7 min read
Designing an Effective CMDB: Boost Ops Efficiency, Alert Convergence & Self‑Healing
Qunar Tech Salon
Qunar Tech Salon
Jun 16, 2017 · Operations

OpsRobot: Chatbot‑Based Operations Automation Platform Overview

OpsRobot integrates development tools into a chat‑based interface, using custom plugins and APIs to automate low‑efficiency, error‑prone operational tasks, thereby streamlining workflows, improving efficiency, and enabling future capabilities such as self‑healing and automated scaling.

API GatewayChatbotOps Automation
0 likes · 5 min read
OpsRobot: Chatbot‑Based Operations Automation Platform Overview
Efficient Ops
Efficient Ops
Apr 19, 2016 · Operations

How Tencent’s Blue Whale Powers Unattended Ops, SaaS Automation, and DevOps Value

The talk outlines Tencent’s Blue Whale platform, describing how automated publishing tools, unattended change processes, fault‑handling strategies, alert‑driven self‑healing, low‑cost tool culture, and a thriving DevOps ecosystem together transform operations from routine maintenance to high‑value, scalable services.

Cost OptimizationDevOpsSaaS
0 likes · 12 min read
How Tencent’s Blue Whale Powers Unattended Ops, SaaS Automation, and DevOps Value