Case Study of Baidu's Z Service Architecture Refactoring Project: Challenges, Diagnosis, and Improvement Plan
This case study examines Baidu's large‑scale Z service architecture refactoring project, detailing its organizational background, stakeholder pain points, current waterfall delivery issues, a CMMI‑based maturity assessment, and a two‑phase improvement plan aimed at adopting continuous integration and agile practices to achieve more predictable and faster software delivery.
Background: Around 2010, Baidu rapidly expanded to over 15,000 employees, but its organizational structure remained unchanged. Testers belonged to a large testing department, operations to a large operations department, while product and development staff were distributed across business groups such as PS, HS, and BS. A Project Management Office (PMO) comprised three sub‑teams: configuration management (SVN repository and version control, supported by the iCafe platform), platform development (knowledge‑management system, demand‑management and continuous‑integration platforms), and process‑improvement (SQA team researching and piloting advanced software project‑management concepts).
The project originated when the PS team, known for pragmatic experimentation, observed successful continuous‑integration adoption in another department. After a successful report in early 2011, the web‑search department piloted the Z service architecture refactoring as a key project.
Stakeholder expectations:
Middle‑level managers: Large projects must keep schedule and quality risks under control.
"Our biggest problem is poor planning. Even though we make plans, unexpected situations constantly make the project uncontrollable, leading to shifting delivery dates and long, unpredictable cycles. For example, a typical three‑month development cycle includes two months of development and one month of testing, but actual timelines vary wildly."
"We recently halted a five‑person, six‑month project because it could not be merged into the mainline in time, making further deployment unnecessary. I only expect a method that makes long‑cycle projects more predictable."
Front‑line managers: New methods are welcome as long as they meet delivery deadlines.
"Teacher, I want to deliver quickly; we will fully cooperate with your experience, but the project must be completed by milestone A."
Team members expressed numerous frustrations, including inaccurate effort estimates, frequent ad‑hoc tasks, delayed hand‑offs between development and testing, repeated manual testing, long environment‑setup times, and deployment issues caused by configuration errors.
Product and team context: The product is a backend web‑crawling and analysis service consisting of seven independent process modules, written in C/C++ with roughly 100,000 lines of code, running on about 300 servers. Initially the team had four developers (two senior, two interns) and two testers; after six months the team grew to ten members with some turnover. No dedicated product manager existed; senior developers assumed product‑management responsibilities.
Project management hierarchy at Baidu classifies projects into four levels (A, B, C, D) based on size. Small projects are led directly by a technical lead; larger projects have designated product and test leads. The delivery model follows a traditional waterfall approach: requirements gathering, centralized development, integration, testing, bug fixing, and finally queuing for production deployment.
Initial Diagnosis
Using CMMI maturity levels, the current delivery state is assessed as between the "Initial" and "Defined" levels:
Initial Level: Delivery relies on a few key individuals; success is often attributed to “someone being there”.
Defined Level: Milestone‑based delivery with frequent schedule slips and overtime due to unmet requirements.
Managed Level: Regular, rhythmic delivery (not yet achieved).
Quantitative Level: On‑demand releases (not yet achieved).
Optimizing Level: Cost‑free, multi‑variant releases such as gray‑scale or A/B testing (future goal).
The team is currently stuck between Initial and Defined levels because long‑duration projects suffer from delayed feedback; most defects surface during integration testing, making overall schedule highly uncertain. Production incidents further increase risk.
Proposed Solution
To reach the Managed level, the team must change its work mode, code‑line management, development infrastructure, and team mindset.
Establish Vision and Set Goals
As the guiding mentor, I aligned with managers on the pilot objectives and communicated them to all stakeholders.
Phase 1 (Short‑term) Goals:
Ensure project delivery dates meet expectations.
Establish a new software‑development collaboration model.
Build necessary infrastructure to support continuous delivery.
Phase 2 Goals:
Shorten demand cycle time for rapid releases.
Maintain production‑environment quality.
Reduce overall testing effort.
The improvement pilot is divided into two stages: the first stage focuses on a continuous‑integration‑driven "Agile 101" release to prove the concept; the second stage will expand to continuous delivery throughout the summer.
(To be continued…)
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.