Designing and Applying the Dragonfly Strategy Engine at Kuaishou to Tackle Complex Recommendation System Challenges
This article describes how Kuaishou built the Dragonfly strategy engine framework—covering problem analysis, architecture design, DSL-based workflow orchestration, process and data abstractions, ecosystem tools, and future plans—to solve the scalability, coupling, and maintenance issues of its rapidly expanding recommendation services.
Kuaishou’s rapid business growth from 2018 to 2025 increased daily active users from 100 million to 376 million and expanded recommendation scenarios from a few pages to hundreds, creating two main demands: quickly building new recommendation scenes and rapidly replicating effective strategies.
Initially, the team copied existing architecture code, but as scene count grew, this approach became unsustainable due to high maintenance cost, tight engineering resources, and tight coupling between algorithm and system code written in C++.
To break the cycle of frequent large‑scale refactoring, Kuaishou developed the Dragonfly framework, a general‑purpose graph‑engine for search‑advertising‑promotion (搜广推) that provides a unified base engine, flexible workflow composition, and a Python‑based DSL that compiles to JSON for C++ runtime.
Dragonfly’s core abstractions include process abstraction —splitting business logic into reusable operators organized as a DAG—and data abstraction —a high‑performance DataFrame structure that offers schema‑free, key‑value access without recompilation.
The DSL layer offers high‑level operators, async/parallel decorators, and automatic code generation, allowing algorithm engineers to describe complex strategies in Python while the underlying C++ operators handle execution efficiently.
Layered decoupling separates algorithm workspaces (DSL scripts) from engineering workspaces (C++ operators), preventing strong coupling and enabling independent evolution of each layer.
An extensive ecosystem—Playground for online DSL debugging, white‑box tracing, visualization, and code‑governance tools—supports the full lifecycle from development to monitoring and automated cleanup of unused code.
Future plans focus on performance (NUMA‑aware allocation, graph optimizations), governance (automatic feature retirement and self‑cleaning code), and productization (AI‑driven tooling and B‑to‑B solutions).
The article concludes with a Q&A session addressing custom operator expressiveness, granularity decisions, control‑flow implementation compared with TensorFlow, and the relationship between DSL operators and micro‑service partitioning.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.