Platform Engineering: Challenges and Best Practices in Large-Scale Implementation
Platform engineering at scale requires unified self‑service abstractions, domain‑specific languages like KCL, divide‑and‑conquer monorepo structures, robust modeling and automation, and a collaborative culture, as demonstrated by Ant Group’s KusionStack implementation that supports thousands of projects with a sub‑one‑to‑nine platform‑to‑developer ratio.
This article explores platform engineering practices from multiple perspectives including platform engineering, domain-specific languages, divide-and-conquer strategies, modeling, automation, and collaborative culture. It shares insights from KusionStack technology stack implementation in Ant Group's platform engineering and automation practices.
The article begins by examining DevOps anti-patterns that many enterprises encounter, such as Dev and Ops teams working in silos or simple forced DevOps implementations. It highlights that platform engineering plays a crucial role in successful DevOps transformations, whether through Meta's approach of Dev fully owning Ops or Google's introduction of SRE teams as intermediaries.
Platform engineering aims to help enterprises build self-service operations systems for application developers. Key objectives include designing appropriate abstraction layers to reduce cognitive load on infrastructure and platform technologies, providing unified work interfaces to avoid fragmented platforms, enabling rapid work through internal engineering platforms, supporting self-service application lifecycle management through CI/CD/CDRA products, and fostering collaborative sharing cultures.
The article emphasizes that not everyone should or can become an expert in this domain, as platform technology teams typically specialize in their own areas. With cloud-native technologies widely adopted, the complexity of managing hundreds of thousands of highly configurable platform configurations, PaaS business complexity, and high stability requirements makes platform engineering essential for simplifying DevOps participation for application developers.
The article then discusses domain-specific languages (DSL) as a powerful engineering approach. KCL (Kusion Configuration Language) is presented as a statically-typed language designed for application developers with programming capabilities. It provides modern high-level language writing experiences with domain-specific functionality. KCL is not just for writing key-value pairs but serves as a domain-specific language for platform engineering, enabling developers to write application configurations, model abstractions, functional functions, and constraint rules.
The divide-and-conquer strategy is presented as key to solving scale problems. Traditional monolithic platforms struggle with increasing business complexity and technical evolution. The article advocates for a client-side approach using Konfig monorepo as the unified programming interface and workspace. This provides independent white-box programming spaces for different scenarios, projects, and applications, with inherent extensibility from flexible engineering structure design, automatic merging technology for independent configuration blocks, static type system technology, project-level GitOps CI workflow definitions, and Kusion engine provision technology choices.
Regarding modeling, the article discusses the explicit vs implicit debate in enterprise settings. For end-user application developers, abstract models are adopted with typical application scenarios (like Ant's SOFA applications) modeled by platform developers and SREs. KCL schema and mixin mechanisms help users with modeling, abstraction, inheritance, composition, and reuse. For platform technology experts, explicit approaches are supported with necessary dynamic and modular features, while type and constraint mechanisms ensure stability.
The automation section covers how infrastructure operations automation has evolved with cloud-native technologies. Engineering efficiency platforms now deeply participate in Konfig monorepo's open automation practices. The article discusses challenges with monorepo collaboration, including diverse business needs requiring independent and powerful workflow customization, high real-time parallel workflow execution capabilities, and performance requirements for KCL compilation and runtime execution.
Finally, the article emphasizes that beyond technology, tools, and mechanisms, the most important aspect is collaborative culture and teamwork. It shares experiences from Ant Group's practice, including initial challenges with self-service mechanisms and collaborative culture concerns, the importance of establishing virtual organizations with common goals, and the need for cultural construction alongside technical implementation.
The article concludes with statistics from Ant Group's practice: over 400 developers contributed to Konfig monorepo code, managing over 1500 projects with a platform developer to application developer ratio of less than 1:9. The platform handles 200-300 commits daily, 1K pipeline tasks, and 10K KCL compilations. Future plans include improving usability, enhancing testing approaches, building IDE-based offline workspaces, and expanding application scenarios to CI builds and automated operations.
Ant R&D Efficiency
We are the Ant R&D Efficiency team, focused on fast development, experience-driven success, and practical technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.