Alibaba Cloud Infrastructure
Author

Alibaba Cloud Infrastructure

For uninterrupted computing services

357
Articles
0
Likes
1.1k
Views
0
Comments
Recent Articles

Latest from Alibaba Cloud Infrastructure

100 recent articles max
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 12, 2025 · Cloud Native

Transform a Single‑Cluster CD Pipeline into a Multi‑Cluster System with ACK One

This guide explains how to leverage Alibaba Cloud's ACK One multi‑cluster application distribution together with the Cloud Effect DevOps platform to convert an existing single‑cluster continuous delivery pipeline into a resilient, multi‑region, multi‑cluster CD solution without modifying original YAML resources.

ACK OneCloud EffectContinuous Delivery
0 likes · 9 min read
Transform a Single‑Cluster CD Pipeline into a Multi‑Cluster System with ACK One
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 1, 2025 · Artificial Intelligence

Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling

This article demonstrates how to use ACK AI Profiling, built on eBPF and dynamic process injection, to perform non-intrusive, low‑overhead profiling of Kubernetes‑deployed large‑language‑model inference services, identify GPU memory growth causes, and apply optimization recommendations to prevent OOM issues.

AI profilingGPU MemoryKubernetes
0 likes · 10 min read
Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 30, 2025 · Artificial Intelligence

Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era

This article summarizes Zuoyebang's infrastructure leader Dong Xiaocong's presentation on the challenges of AI inference demand and supply, and describes the design and implementation of a unified compute network—including trusted networking, multi‑region container scheduling, and traffic routing—to efficiently serve large‑scale AI models.

AIInfrastructureModel Distribution
0 likes · 9 min read
Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 28, 2025 · Cloud Native

Improving OSS Small‑File Access Performance with StrmVol Storage Volumes in Kubernetes

StrmVol storage volumes replace the FUSE‑based OSS mount with a virtual block device and kernel‑mode file system, dramatically reducing latency for massive small‑file reads in Kubernetes workloads such as AI training datasets, and the article demonstrates setup, configuration, and performance testing using Argo Workflows.

Argo WorkflowsCSIKubernetes
0 likes · 13 min read
Improving OSS Small‑File Access Performance with StrmVol Storage Volumes in Kubernetes
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 25, 2025 · Fundamentals

Alibaba Network Proposal OSFP MSA Passes Unanimously, Introducing the First Liquid‑Cooled OSFP Cage Standard

Alibaba Cloud’s infrastructure network team’s split‑type OSFP Cage proposal was unanimously approved by the OSFP MSA committee, becoming the first standard supporting liquid‑cooled OSFP cold plates, offering low‑cost, easy‑assembly solutions that address the growing power‑consumption challenges of high‑density AI switches.

AI SwitchesHardware StandardLiquid Cooling
0 likes · 5 min read
Alibaba Network Proposal OSFP MSA Passes Unanimously, Introducing the First Liquid‑Cooled OSFP Cage Standard
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 18, 2025 · Artificial Intelligence

Alibaba Cloud Showcases Optical Interconnect Innovations at OFC 2025 50th Anniversary

At the OFC 2025 50th anniversary in San Francisco, Alibaba Cloud presented cutting‑edge optical interconnect research and solutions for AI computing and modern data‑center networks, highlighted by invited talks, breakthrough demos, and two data‑driven QoT estimation papers co‑authored with Hong Kong Polytechnic University.

AI computingCloud NetworkingData Center
0 likes · 6 min read
Alibaba Cloud Showcases Optical Interconnect Innovations at OFC 2025 50th Anniversary
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 17, 2025 · Cloud Native

OpenKruise 1.8 Release Highlights: In‑Place VPA, StatefulSet Volume Expansion, AI WorkloadSpread, Serverless Probe, SidecarSet Gray‑Release, and Helm Pre‑Delete Hook

OpenKruise 1.8, the latest CNCF‑incubated cloud‑native automation suite, introduces in‑place vertical pod autoscaling, native StatefulSet volume expansion, AI‑aware WorkloadSpread, serverless probe support, sidecar gray‑release capabilities, and a Helm pre‑delete safety hook, all backed by detailed YAML examples and future roadmap.

Cloud NativeInPlaceVPAKubernetes
0 likes · 13 min read
OpenKruise 1.8 Release Highlights: In‑Place VPA, StatefulSet Volume Expansion, AI WorkloadSpread, Serverless Probe, SidecarSet Gray‑Release, and Helm Pre‑Delete Hook
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayKubernetesLLM
0 likes · 19 min read
Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM