Microsoft Azure DevOps Testing Left‑Shift: Practices, Principles, and Metrics
This article explains how Microsoft’s Azure DevOps team transformed its testing approach by shifting tests left, introducing new quality principles, redefining test classifications, improving automation reliability, and measuring progress with DevOps metrics to achieve faster, more trustworthy continuous integration and delivery.
Background
In the previous article we discussed Microsoft’s "Testing Right‑Shift" (TIP). Some colleagues argued that Windows 10’s quality issues indicated problems with that practice.
However, this series focuses on the Azure DevOps (formerly VSTS) team’s approach; Windows itself is not a cloud service product.
Testing Left‑Shift Overview
We now describe Microsoft’s testing left‑shift. While some content overlaps with earlier posts, new practices and principles are introduced.
How We Used to Work
In September 2014, three years into the cloud era, we still followed pre‑cloud testing methods, trying to speed up tasks and optimize automation but constantly struggling.
Problems with Automation
Our automated test suite took too long. The Nightly Automation Run (NAR) required 22 hours, and the Full Automation Run (FAR) took two days. Tests frequently failed, producing large numbers of false failures that were too costly to triage, leading teams to ignore failures before sprint end.
We focused on keeping a small set of high‑priority (P0) tests reliable, achieving about 70% pass rate, but still faced failures from infrastructure, product issues, and test defects.
Feedback from master‑branch validation arrived 12 hours after a commit, making it hard to act on failures before the sprint closed, often delaying releases by weeks.
New Quality Vision
In February 2015 we published a new Azure DevOps quality vision, redesigning the test suite from the ground up, with a layered model (L0/L1 unit tests, L2/L3 functional tests).
Testing Principles
Write tests at the lowest possible level Prefer tests with minimal external dependencies, running as part of the build. If a unit test (L0) can provide the needed information, avoid functional tests (L2/L3).
Write once, run everywhere, including production Avoid tests that depend on a custom test server (Object Model) or internal knowledge; functional tests should use only public APIs, not back‑doors.
Design for testability Embed testability into product design so that most tests can be unit tests; treat test code with the same rigor as production code.
Test code is production code Test code must be reliable, reviewed, and maintained like product code; neglecting test code quality undermines confidence.
Testing infrastructure is a shared service Testing should be integrated into the build pipeline, run under Visual Studio Team Explorer, and be as reliable as product code.
Test ownership aligns with product ownership Developers own tests for their components; they should not rely on others to test their code.
Testing Left‑Shift in Practice
Quality signals are generated earlier, often before code merges to master. Most tests run before a change reaches the main branch.
Re‑classifying Tests
We introduced a new classification based on external dependencies:
L0/L1 – Unit Tests L0: fast, memory‑only tests (< 60 ms). L1: may depend on SQL or file system (< 400 ms, max 2 s).
L2/L3 – Functional Tests L2: runs on testable service deployments, with limited dependencies. L3: full integration tests running on production‑like environments.
Ensuring Isolation for Functional Tests
L2 tests must be isolated, controlling their environment fully to avoid cross‑test interference. We built a fake identity provider to replace external authentication services.
Metrics and Progress
We track a “North Star” metric per iteration, showing a reduction from 27 000 legacy tests to fewer than 14 000 by iteration 101, with a growing number of L0/L1 unit tests.
Key milestones:
PR to merge: ~30 minutes, running ~60 000 unit tests.
Merge to CI build: ~22 minutes.
First quality feedback from CI: ~1 hour.
Full test cycle to self‑hosted environment: < 2 hours.
DevOps Metrics Used
We maintain a team scorecard tracking two metric families:
MTT(x) : time to detect and mitigate production issues, and time to ship a fix.
Project Health : number of unresolved defects per engineer; if >5, defect remediation is prioritized over new features.
We also monitor engineering speed by measuring CI/CD pipeline stages.
All content is derived from a Microsoft 2017 presentation.
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.