Analyzing the Origins of Flaky Tests: Size, Tooling, and Instability at Google
This article examines why some tests become flaky, showing that larger test binaries and higher RAM usage strongly correlate with instability, while the choice of testing tools has a smaller effect, and offers recommendations for reducing flaky tests in large‑scale continuous integration environments.
Flaky tests are those that sometimes pass and sometimes fail when run against the same code, making failure signals ambiguous. When a previously passing test starts failing, it may indicate a new bug, but flaky tests can obscure this signal.
Google’s continuous integration system runs about 4.2 million tests, with roughly 63 000 (≈2 %) exhibiting at least one flaky occurrence per week. Understanding and fixing flaky tests requires analyzing their characteristics.
Test Size and Flakiness
Tests are categorized subjectively as small, medium, or large. In a week, 0.5 % of small tests, 1.6 % of medium tests, and 14 % of large tests were flaky, indicating a clear increase in instability with test size.
Objective metrics—binary size and RAM usage—show a strong correlation with flakiness. Larger binaries and higher RAM consumption correspond to higher flaky rates, with linear fits (r² up to 0.94 for the most predictive subsets).
When bucket sizes are adjusted, the correlation improves, suggesting that binary size and RAM are better predictors of flakiness than test size alone.
Tool Influence
Tests written with certain tools (e.g., WebDriver) show higher flaky rates, but this is largely because those tools are used for larger tests. After accounting for size, tool impact is modest.
Further analysis comparing RAM usage and binary size across tools confirms that RAM usage explains more variance in flakiness than the tool itself.
Conclusions
While test size correlates with flakiness, the lack of fine‑grained size categories at Google limits practical use. Objective measures like binary size and RAM usage are strong indicators of test fragility.
Tests written with specific tools appear more flaky, but this is mostly due to their larger size; tool choice alone contributes little.
Before writing large tests, engineers should consider the minimal test needed and be cautious, as larger tests require extra effort to prevent instability.
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.