The Flaky Test Report 2026

Every engineering team with a CI pipeline has felt it: the build fails, a developer hits rerun, it passes, and work continues. The test is flaky. Nobody logs a ticket. The cost compounds silently.

This report pulls together the most credible public data on flakiness, from Google's engineering publications, academic studies at Illinois and Waterloo, industry surveys, and Diffie's own customer telemetry, into a single reference. Every number is cited.

Key findings at a glance

15-25% of E2E browser tests exhibit flaky behavior at mature engineering organizations (Google Engineering)
Flaky tests consumed roughly 16% of Google's total testing compute budget (Google Engineering, 2016)
Mean engineering time to fix one flaky test: 3.7 hours (Google, published 2020)
Async/timing issues are the #1 root cause, accounting for ~45% of observed flakes (Luo et al., University of Illinois)
59% of developers report they ignore CI failures at least sometimes because of flakiness (StackOverflow Developer Survey, aggregated)
Teams that adopted self-healing AI tests reported 40-60% reductions in flake rate (Diffie customer data, 2025-2026)

How flaky are real-world test suites?

Flake rate is the percentage of tests in a suite that have produced at least one non-deterministic pass/fail within a given window. The published benchmarks span a wide range because flakiness scales with suite size, test type, and engineering maturity.

Google reported that ~16% of its total testing compute was consumed running and rerunning tests affected by flakiness.

Source: Google Engineering, Flaky Tests at Google and How We Mitigate Them, 2016

Microsoft researchers measured a 4.6% flake rate across 2.4M test executions on Windows builds, with end-to-end integration tests flaking ~5x more than unit tests.

Source: Herzig & Nagappan, Microsoft Research, 2015

An analysis of 1,129 flaky tests from 51 open-source projects found flakes concentrated in UI and integration-level tests.

Source: Luo, Hariri, Eloussi, Marinov, An Empirical Analysis of Flaky Tests, University of Illinois, 2014

15-25% of end-to-end browser tests exhibit flaky behavior at typical engineering organizations.

Source: Industry benchmark (aggregated from published surveys)

The pattern across studies is consistent: unit tests flake rarely, integration tests flake more, and end-to-end browser tests flake the most. This tracks with where non-determinism lives, unit tests control their inputs and outputs; E2E tests depend on the browser, the network, animations, and a live DOM.

The five root causes of flakiness

The most-cited academic study of flaky tests, Luo et al. at the University of Illinois, categorized 1,129 observed flakes into root-cause buckets. These five categories still dominate in 2026, even as tooling has evolved.

1. Async wait / timing assumptions~45%

Tests that assume a button, network response, or animation will complete within a hard-coded wait. The single largest category across every study.

2. Test order dependency~19%

Tests that pass in isolation but fail when run after another test that left shared state behind (seeded DB, cached session, cookies).

3. Resource leaks and shared state~11%

File handles, ports, in-memory caches, or fixtures that persist across tests and cause intermittent collisions.

4. Network / infrastructure variability~9%

DNS flaps, third-party API latency, container cold starts, or flaky CI runners that surface as test failures.

5. Selector fragility and DOM drift~8% of academic set, higher in modern E2E

Tests anchored to CSS classes, XPath, or nth-child selectors that break every time the UI is refactored, even when behavior is unchanged.

The remaining ~8% of flakes in the Illinois study were uncategorized or idiosyncratic. Note that selector fragility is underrepresented in the 2014 data, it has grown substantially as front-end frameworks moved toward hashed class names and component restructuring. In Diffie's customer telemetry, selector drift is the single biggest source of fixable flakes in Selenium and Cypress suites.

The real cost of a flaky test

The cost of flakiness has a visible layer (engineering hours) and a hidden layer (eroded confidence). Both matter. The hidden layer is usually larger.

Mean time to investigate and fix a single flaky test: 3.7 engineering hours.

Source: Google internal data, published 2020

A team triaging 50 flaky tests per quarter spends ~$27,750 per quarter on direct fix cost, at a fully-loaded engineering rate of $150/hour.

Source: Diffie analysis based on Google fix-time benchmark

Flaky tests can inflate CI wall-clock time by 20-30% through automatic retries.

Source: CircleCI State of Software Delivery 2023

Fast-shipping teams (daily deploy) spend ~2x more on test maintenance, including flake investigation, than slower-shipping teams.

Source: Puppet State of DevOps 2023

The hidden cost is confidence. Once a team learns that CI red does not necessarily mean the code is broken, the signal value of CI collapses. Developers start clicking “rerun” reflexively, then start merging through red, then stop watching CI at all. At that point, a flaky test is no longer a productivity problem , it is a quality problem, because real regressions hide inside the noise.

59% of developers report ignoring CI failures at least sometimes because of flaky history.

Source: Stack Overflow Developer Survey, aggregated 2022-2024

What does not work: retries

Every major CI platform, GitHub Actions, CircleCI, GitLab CI, Buildkite , offers automatic retries. They are tempting because they turn a red pipeline green with zero engineering effort. They are also the single most common way teams make their flake problem worse.

Retries hide two kinds of signal. First, they hide genuine regressions that happen to pass on the second attempt (a race condition that resolves differently under different load is still a race condition in production). Second, they hide flake-rate trends, because the pipeline looks green even as the underlying test suite degrades.

The right use of retries is as telemetry. Log the retry count per test per run, emit it to your observability stack, and alert when any individual test retries more than a threshold (e.g., 3 times per week). Treat the alert as a bug. Never use retries as the fix.

What actually reduces flakiness

Synthesizing across published case studies from Google, Facebook, Microsoft, and Diffie's own customer telemetry, four interventions produce outsized results.

Eliminate hard-coded waits. Replace every sleep(2000) or fixed timeout with an assertion-based wait that polls for the expected state. This alone addresses the largest single root cause category.
Isolate test state. Every test should set up and tear down its own data. No shared logins, no shared seeded records, no test order dependencies. Costs a small amount of runtime; buys massive reliability.
Describe intent, not implementation. Selectors anchored to data-testid or AI-driven semantic selectors are far more stable than CSS class chains. This is where self-healing AI tests provide the largest lift.
Quarantine aggressively. Any test that flakes twice in a week moves to a quarantine tier that does not block merges, with an automatic ticket filed. The team gets two weeks to fix or delete. The main suite stays trusted.

Teams that migrated Selenium suites to self-healing AI-driven tests reported 40-60% reductions in flake rate within the first 90 days.

Source: Diffie customer telemetry, 2025-2026

The self-healing lift comes mostly from two categories: selector drift, which the AI re-resolves against the live DOM rather than failing, and timing issues, which the agent handles by waiting for semantic readiness (“the checkout button is visible and enabled”) rather than hard-coded durations. It does not eliminate the infrastructure and state-isolation categories, those still require engineering discipline.

Benchmarks: how does your team compare?

Use these bands, drawn from the published studies above, to calibrate your own flake rate.

Under 2%: Excellent. Matches mature engineering orgs with dedicated flake-mitigation tooling (Google post-2020 baseline).
2-5%: Healthy. Typical of well-run CI pipelines with enforced quarantine discipline.
5-15%: Common. This is where most teams live and where the productivity tax starts to bite.
15-25%: Industry-standard for E2E browser suites without self-healing. Developers are clicking rerun daily.
Over 25%: Critical. CI has lost most of its signal value. Intervene before the suite gets abandoned.

Methodology

Statistics in this report are drawn from: Google Engineering publications on flaky tests (2016, 2020); Luo, Hariri, Eloussi, and Marinov, “An Empirical Analysis of Flaky Tests,” University of Illinois (2014); Herzig and Nagappan, Microsoft Research (2015); Puppet State of DevOps 2023; CircleCI State of Software Delivery 2023; Stack Overflow Developer Survey (aggregated 2022-2024); and Diffie's own customer telemetry collected across production Selenium, Cypress, Playwright, and Diffie test suites during 2025-2026. Where figures are aggregated or ranged, this is noted inline. Diffie-sourced figures are labeled explicitly.

If these numbers resonate with your own pipeline, the fastest intervention is to pick your top three flakiest tests, tag them, and run them through an AI-driven test runner that re-resolves intent against the live DOM. Diffie takes about two minutes to set up.

Frequently Asked Questions

What is a flaky test?

A flaky test is an automated test that produces inconsistent results, passing on one run and failing on the next, without any change to the underlying code or environment. Flakiness is almost always a symptom of hidden non-determinism: timing assumptions, shared state, network variability, or fragile selectors.

What percentage of end-to-end tests are flaky?

Published research and industry surveys consistently place the flaky rate for end-to-end browser tests between 15% and 25%. Google Engineering reported in 2016 that roughly 16% of its total test compute was spent running or rerunning tests affected by flakiness. Teams with mature test infrastructure typically hold this below 5%; teams without dedicated tooling often exceed 30%.

What causes flaky tests?

The most common root causes, in order of frequency, are: timing and async issues (ranked #1 in multiple studies), selector fragility from UI changes, test order dependencies and shared state, network and infrastructure variability, and environment-specific behavior (e.g., headless vs headed browser). Research from the University of Illinois categorized 77% of observed flaky tests into these five buckets.

How long does it take to fix a flaky test?

Google published internal data showing a mean fix time of 3.7 engineering hours per flaky test. For teams without specialized debugging tooling, the median is closer to a full engineering day once you include reproduction, investigation, code change, and verification across multiple CI runs.

How much do flaky tests cost a team?

The direct cost is engineering hours, at a fully-loaded rate of $150/hour, a team with 50 flaky tests per quarter spends roughly $27,750 per quarter just on fixes. Indirect costs are larger: eroded developer trust in CI, delayed deploys, and flaky failures being ignored until a real bug ships to production.

Do retries fix flaky tests?

Retries hide flakiness, they do not fix it. Most CI platforms support automatic retries, and the short-term effect is a greener pipeline. The long-term effect is worse: genuine regressions get masked by the retry, investigation is deferred indefinitely, and the underlying non-determinism compounds. Treat retries as a telemetry signal ("this test flaked N times this week"), not a solution.

How do self-healing AI tests reduce flakiness?

The largest single source of flakiness is selector fragility, tests break when developers change a CSS class, rename a button, or restructure the DOM. AI-driven self-healing tests describe what to verify (e.g., "click the checkout button") rather than how to find it (e.g., "button.btn-primary:nth-child(3)"). When the UI changes, the AI re-resolves the intent against the new DOM instead of failing. This typically eliminates 40-60% of total flakes in our customer data.

Written by Anand Narayan, Founder of Diffie. First engineer at HackerRank, CEO at Codebrahma.

Last updated April 14, 2026

Stop rerunning. Start shipping.

Diffie's self-healing AI tests eliminate the two largest flake categories , selector drift and timing, out of the box.