Tag

cloud operations

1 views collected around this technical thread.

Efficient Ops
Efficient Ops
Dec 22, 2024 · Operations

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, OpenAI suffered a severe outage across ChatGPT, its API, and Sora due to a misconfigured telemetry service that overloaded Kubernetes control planes worldwide, prompting a cascade of failures and a coordinated recovery effort.

OpenAIcloud operationsincident management
0 likes · 8 min read
What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery
Efficient Ops
Efficient Ops
Dec 26, 2023 · Operations

What Is ITU’s New AIOps Standard and How It Shapes Cloud Operations?

The article explains the ITU‑T Y.3550 AIOps standard, its AI‑driven cloud service development and operation requirements, the Chinese AIOps maturity‑model series, and the latest assessment results showing dozens of enterprises adopting these intelligent‑operations capabilities.

AIAIOpsITU standard
0 likes · 6 min read
What Is ITU’s New AIOps Standard and How It Shapes Cloud Operations?
Efficient Ops
Efficient Ops
Aug 16, 2023 · Operations

How to Accurately Set Service Rate‑Limiting Thresholds in Large Cloud Systems

This article examines the challenges of setting effective rate‑limiting thresholds for massive cloud‑native services, compares TPS and concurrency metrics, proposes stress‑testing and historical‑data‑ARMA forecasting methods, and presents a practical system that delivers reliable limits for both node‑wide and per‑service protection.

ARMA forecastingPerformance TestingRate Limiting
0 likes · 10 min read
How to Accurately Set Service Rate‑Limiting Thresholds in Large Cloud Systems
Efficient Ops
Efficient Ops
Apr 29, 2022 · Operations

How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons

This article shares Ctrip's practical experiences in scaling a hybrid private‑cloud platform to over ten thousand nodes, covering Kubernetes control‑plane stability, host monitoring, network observability, image management, and capacity planning to ensure high availability for massive online services.

Container ImagesNetwork Observabilitycloud operations
0 likes · 18 min read
How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons
DevOps Cloud Academy
DevOps Cloud Academy
Sep 9, 2021 · Operations

FinOps and DevOps Best Practices for Microsoft ERP Projects

This article explains FinOps as cloud financial operations, outlines how to plan Microsoft ERP projects, and presents eight DevOps best practices—including empowered teams, version control, deployment automation, trunk‑based development, continuous testing, test automation, shift‑left security, and monitoring—while advising on selecting appropriate DevOps tools.

DevOpsFinOpsMicrosoft ERP
0 likes · 10 min read
FinOps and DevOps Best Practices for Microsoft ERP Projects
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 30, 2020 · Operations

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

The article explains how Facebook manages dynamic runtime configuration for millions of services—covering feature gating, experiments, traffic control, topology balancing, monitoring, machine‑learning model updates, and internal behavior—using a suite of tools such as Configerator, Gatekeeper, Package Vessel, Sitevars, and MobileConfig.

AB testingConfiguration Managementcloud operations
0 likes · 8 min read
Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling
Tencent Cloud Developer
Tencent Cloud Developer
Nov 21, 2019 · Operations

Serverless Operations: Efficient and Intelligent Cloud-native Practices

The article recaps Tencent Cloud’s Serverless operational suite—covering built‑in DevOps tools, logging, monitoring, auto‑scaling, and security—demonstrating how it replaces manual IaaS provisioning, accelerates development, and enables cloud‑native management, illustrated by a WeChat Mini‑Program album that cut build time from months to two weeks.

DevOpsInfrastructureTencent Cloud
0 likes · 19 min read
Serverless Operations: Efficient and Intelligent Cloud-native Practices
Tencent Cloud Developer
Tencent Cloud Developer
Nov 13, 2019 · Operations

Recap of Cloud+ Community Tech Salon – Efficient Intelligent Operations

The Cloud+ Community’s 29th technical salon on November 9 2019 in Shenzhen gathered Tencent and Jiwei experts to showcase efficient intelligent operations through AIOps practices, massive cloud migration strategies, the Blue Whale PaaS framework, Serverless DevOps best practices, and Kubernetes resource‑utilization techniques.

AIOpsDevOpsPaaS
0 likes · 6 min read
Recap of Cloud+ Community Tech Salon – Efficient Intelligent Operations
Ctrip Technology
Ctrip Technology
Mar 7, 2019 · Operations

Ctrip Container Cloud Operations: Practices, Challenges, and Future Outlook

This article presents Ctrip's experience in building and operating a private container cloud platform, detailing its architectural evolution, operational challenges, tooling, monitoring, capacity management, and future directions toward hybrid and cloud‑native environments.

ChatOpscapacity-managementcloud operations
0 likes · 12 min read
Ctrip Container Cloud Operations: Practices, Challenges, and Future Outlook
JD Tech
JD Tech
Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

AIContainer OrchestrationJD
0 likes · 6 min read
Technical Overview of JD's Archimedes Resource Scheduling System
Efficient Ops
Efficient Ops
Aug 16, 2018 · Operations

How Tencent Automates Massive Storage, CDN, and Network Operations at Scale

This article introduces three Tencent TEG sessions that reveal the automated operation systems behind massive storage and CDN services, billion‑level promotional event guarantees, and intelligent DCI network management, highlighting the challenges, solutions, and speaker expertise.

automationcdncloud operations
0 likes · 7 min read
How Tencent Automates Massive Storage, CDN, and Network Operations at Scale
Efficient Ops
Efficient Ops
Apr 18, 2018 · Operations

Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless

At the 9th Global Operations Conference, Huawei Cloud’s chief architect Cai Xiaogang presented a three‑pronged AIOps strategy that combines large‑scale Kubernetes management, causal tracing in Serverless environments, multi‑source RCA analysis, and clustering‑based black‑box network packet inspection, showcasing how academia‑industry collaboration accelerates cloud‑native operations.

AIOpscloud operationsclustering
0 likes · 8 min read
Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless
Efficient Ops
Efficient Ops
Jun 6, 2017 · Operations

How to Deploy Reliable Overseas IT Infrastructure: Key Strategies and Tools

This guide outlines essential questions, local network insights, IDC versus network layout choices, and practical tools for companies planning to expand their IT infrastructure across international markets, helping them manage latency, cost, and deployment speed.

IDCIT infrastructurecloud operations
0 likes · 12 min read
How to Deploy Reliable Overseas IT Infrastructure: Key Strategies and Tools
Efficient Ops
Efficient Ops
Mar 7, 2017 · Big Data

How Tencent Scaled Its TDW to 8,800 Nodes and Mastered Cross-City Data Migration

Tencent’s senior engineer explains how the TDW (Tencent Distributed Data Warehouse) grew from a few hundred to thousands of nodes, the challenges of cross‑city migration, and the modeling, relationship‑chain, dual‑write tables, and platform strategies they built to ensure seamless, low‑impact data and task migration.

Big DataTDWcloud operations
0 likes · 26 min read
How Tencent Scaled Its TDW to 8,800 Nodes and Mastered Cross-City Data Migration
Tencent Cloud Developer
Tencent Cloud Developer
Feb 17, 2017 · Operations

Implementing Network Isolation with Elastic Network Interfaces on QCloud

The article explains how to achieve network isolation for a QCloud SQL cluster by creating and binding additional elastic NICs via API—assigning separate production, heartbeat, and storage interfaces to each node—while noting that true physical isolation is impossible and detailing the required configuration steps and encountered challenges.

Elastic Network InterfaceQCloudVPC
0 likes · 8 min read
Implementing Network Isolation with Elastic Network Interfaces on QCloud
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jan 6, 2017 · Operations

How Qcmd Revolutionizes Automated Operations for 7,000+ Servers

Qcmd, the command execution system behind 360’s private HULK cloud platform, replaces SaltStack with an asynchronous, Golang‑based architecture that ensures high‑availability, encrypted messaging, and reliable mass‑host command execution across thousands of servers, dramatically reducing task timeouts and operational overhead.

automationcloud operationscommand execution
0 likes · 10 min read
How Qcmd Revolutionizes Automated Operations for 7,000+ Servers
Efficient Ops
Efficient Ops
Nov 14, 2016 · Operations

How a Banking Card Organization Built a Scalable Cloud Operations Platform

This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.

Operations Managementautomationcloud operations
0 likes · 13 min read
How a Banking Card Organization Built a Scalable Cloud Operations Platform
Architecture Digest
Architecture Digest
Jul 7, 2016 · Operations

Understanding Load Balancing and the Design of Alibaba's VIPServer

This article explains the fundamentals of load balancing, compares common techniques such as DNS round‑robin, hardware and software load balancers, discusses their advantages and drawbacks, and introduces Alibaba's VIPServer as a mid‑tier, seven‑layer load‑balancing solution with advanced health‑check and traffic‑routing features.

DNSHealth CheckL4/L7
0 likes · 19 min read
Understanding Load Balancing and the Design of Alibaba's VIPServer
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 2, 2015 · Fundamentals

Methodology for Implementing Modular Data Centers

This article presents a methodology for modular data center implementation, emphasizing the role of standardization, distinguishing design versus prefabrication, illustrating with micro‑module and container examples, and analyzing the standardization levels of major tech companies and colocation providers.

ICTInfrastructurePod
0 likes · 8 min read
Methodology for Implementing Modular Data Centers