AI-generated Playwright test records analyzed
Median generation time, excluding runs over 1 hour
First-run pass rate among tests with recorded outcomes
What is the Diffie AI E2E Testing Lab Benchmark?
The Diffie AI E2E Testing Lab Benchmark 2026 is an internal performance report based on 893 AI-generated Playwright tests created in Diffie between September 17, 2025 and April 26, 2026. It measures four things directly from the production database: how long the agent takes to generate a test, what share of generated tests pass on the first execution, how often a failed test recovers on a later run, and which application categories the system has been exercised against. The full methodology, denominators, and exclusions are disclosed in line.
Most AI testing claims are still vague. We wanted to put real numbers behind Diffie's own AI E2E test generation system, so we exported every test record created in Diffie between September 17, 2025 and April 26, 2026, and analyzed 893 of them. The dataset includes internal lab scenarios, founder-created tests, and a smaller number of early user-created tests.
This is not a broad customer survey, and we have been careful to label it accordingly. It is our first internal lab benchmark, a transparent look at how Diffie performs across generated browser tests, where it works well, and where the data is not yet good enough to publish a number.
Key findings at a glance
- Diffie analyzed 893 AI-generated Playwright test records across thirteen application categories.
- Median generation time was about 3.6 minutes, after excluding 14 outlier records with durations over one hour.
- Among 387 tests with recorded first-run outcomes, 70.8% passed on the first run.
- 490 of the 893 records were successfully generated at the time of the export. The rest were still in flight, cancelled, or had a generation failure.
- Of 113 tests that failed on first run, 24 later reached a passing run, a 21.2% recovery rate. We do not yet attribute this to self-healing because the underlying mechanism is not separately tracked in this dataset.
Why we ran this benchmark
The AI testing category has more marketing than measurement right now. Vendors claim faster generation, fewer flakes, and less maintenance, but most of the numbers floating around are either generic industry statistics borrowed from non-AI tooling or vague case-study quotes from a single customer.
We wanted to publish a number that came directly from our own production system, with the methodology spelled out and the exclusions disclosed. If a metric was not strong enough to publish, we wanted to say so explicitly rather than fudge it. The result is shorter than we hoped, but every number in the headline section came out of a SQL query against the same database that powers app.diffie.ai.
What is in the dataset
Dataset shape
The 555/893 split on generation duration is because some records were created through paths that did not capture timing fields. The 387/893 split on first-run pass/fail is because the rest had no first-run history at the time of export. The 490/893 split on successfully generated reflects the count of tests whose generation pipeline completed cleanly (tests.processing_status = "processed"). All pass-rate calculations in this report use the executed-tests denominator (387) because that is the cleanest measure of generation quality.
Chart 1. Dataset composition. 893 total records, 555 with generation timing, 387 with recorded first-run outcomes, 490 successfully generated.
Download SVG ↓1. Generation speed
For the 555 records with valid generation duration data, we computed median, P75, and P90 latencies. Fourteen records had durations longer than one hour, the longest about 16 days, which are almost certainly stuck jobs or legacy timing anomalies. We report headline numbers on the cleaned set of 541 records.
Generation duration
Chart 2. Generation time distribution. 555 records with timing data, with the 14 runs above 1 hour highlighted as excluded outliers.
Download SVG ↓Public framing: in Diffie's internal lab benchmark, the median AI E2E test generation time was about 3.6 minutes after excluding runs longer than one hour. The long tail above P90 is dominated by tests against unusually large or slow apps and a handful of generation runs that hit retry loops.
2. First-run pass rate
First-run pass rate is the share of generated tests that pass on the first execution, with no retries, no manual edits, and no self-healing. It is the cleanest measure of generation quality.
First-run outcomes
Chart 3. First-run pass rate among the 387 tests with recorded first-run outcomes.
Download SVG ↓Public framing: among 387 generated tests with recorded first-run outcomes, 70.8% passed on the first run. We are careful with the denominator: this is not 70.8% of all 893 tests in the dataset, because 502 records had no first-run history at the time of export. Always quote it as “among tests with recorded first-run outcomes.”
3. Recovery after first failure
We do not publish a final pass-rate headline because the cleanest denominator (tests successfully generated) leaves more than a fifth of the records unexecuted at any point in time, and any headline number would either flatter the result by hiding that bucket or punish it by counting the unexecuted as failures. What we can report directly is what happened to tests after their first run failed.
Recovery after first failure
Roughly one in five tests that fail on the first run later reach a passing run. We deliberately do not attribute this to self-healing, because the underlying mechanism is not separately tracked in this dataset. The right phrasing is “recovered after follow-up execution, retry, or correction” until healing attempts are instrumented explicitly.
4. Category coverage
Each test record was classified into one of thirteen application categories based on the test name and description. Categories were predefined and the labels were assigned by Claude Haiku 4.5. We treat this as awareness-level data, not a statistical sample, but it is useful to show the spread of application types the system has been exercised against.
Top application categories
Chart 4. App and flow category coverage across all 893 records, sorted from largest to smallest.
Download SVG ↓The dataset spans marketing pages, multi-step auth flows, SaaS dashboards with authenticated state, AI chat interfaces, developer tools, internal admin panels, productivity apps, and ecommerce checkouts. Long-tail categories (marketplace, media, social, other) are shown but should not be read as statistically meaningful.
5. First-run pass rate, broken out by category
We restricted this view to categories with at least 28 executed tests, so each bar represents a defensible sample. Smaller categories (marketplace, media, social, fintech, “other”) are excluded from this chart because the noise would swamp the signal at low n.
Chart 5. First-run pass rate by category. Sample sizes shown after each label. Categories with fewer than 28 executed tests are excluded.
Download SVG ↓The pattern is intuitive: marketing pages and ecommerce checkouts are mostly static and DOM-stable, so first-run pass rate is high (89.6% and 85.7%). SaaS dashboards, AI chat interfaces, and productivity apps run lower (45.2%, 57.1%, 54.5%) because they involve authenticated multi-step state, dynamic content, and animations that the generator has to reason about. We expect that gap to narrow as the agent improves at handling stateful flows.
Methodology
This benchmark is based on 893 AI-generated Playwright test records created in Diffie between September 17, 2025 and April 26, 2026. The dataset includes internal founder-created tests, lab benchmark scenarios, and a smaller number of early user-created tests. All tests use the same Playwright JavaScript framework and run against Browserbase Chrome.
For generation-time calculations, we analyzed the 555 records with valid generation duration data. Runs longer than one hour (14 records) were excluded from headline timing statistics to avoid including stuck jobs, interrupted sessions, or legacy timing anomalies. Reported median, P75, and P90 are computed against the cleaned set of 541 records.
For pass-rate calculations, we analyzed only the 387 records with explicit pass/fail first-run outcomes. Rows without recorded execution status (502 records) were excluded from pass-rate calculations rather than counted as failures, on the grounds that an unexecuted test is not a failed test.
Application categories were assigned by Claude Haiku 4.5 based on test name and description, into thirteen predefined buckets. Spot checks against the source records matched the LLM labels in the cases we reviewed.
Self-healing success rate is not reported in this version because healing-attempt fields were not consistently tracked in the dataset. Failure-pattern breakdown is not reported because 95% of failed runs had empty error fields at the time of export.
| Metric | Included | Excluded |
|---|---|---|
| Total records analyzed | 893 | 0 |
| Generation timing (any duration) | 555 | 338 |
| Generation timing (cleaned, ≤ 1 hour) | 541 | 352 |
| Records successfully generated | 490 | 403 |
| First-run pass rate (denominator: executed tests) | 387 | 506 |
| Self-healing success rate | 0 | 893 |
| Failure pattern breakdown | 5 | 888 |
Downloads & cite this benchmark
We are publishing the summary dataset and chart files so other writers, analysts, and engineers can understand how the benchmark was calculated and embed the charts in their own work. Raw test prompts, app URLs, user identifiers, and any tenant-specific fields are excluded for privacy and security. The raw export is not published.
Summary dataset
Download diffie-ai-e2e-lab-benchmark-2026-summary.csv ↓Aggregate metrics only: counts, percentages, sample sizes, calculation scope, and notes. No test IDs, URLs, prompts, or descriptions.
Chart files (SVG, free to embed with attribution)
Suggested citation
Source: Diffie AI E2E Testing Lab Benchmark 2026, based on 893 AI-generated Playwright test records created in Diffie between September 2025 and April 2026. Available at https://diffie.ai/blog/diffie-lab-benchmark-2026.
Limitations
- The dataset is dominated by internal and founder-created tests. It is not yet representative of customer production usage at scale.
- Pass-rate metrics are conditioned on tests that have actually been executed. Generated-but-unexecuted tests are excluded.
- Failure-category fields are too sparse to draw conclusions from. We cannot report which failure modes dominate yet.
- Self-healing is not separately instrumented in this dataset, so we report recovery as “passed after first failure” rather than as a healing rate.
- All tests share a single framework and browser (Playwright JavaScript on Browserbase Chrome). We cannot compare frameworks or browsers from this dataset.
What we are measuring next
The gaps in this benchmark are the roadmap for the next one. Specifically:
- Self-healing success rate. Track healing attempts as their own event so we can report the share of failures the agent successfully resolves without human input.
- Failure pattern breakdown. Always write structured error data on every failed run, so the next benchmark can report the distribution of timeout vs selector vs assertion vs network failures honestly.
- Manual-edit rate. Distinguish “user edited the prompt” from “system regenerated after a failure” in the spec version history, so we can report how often human intervention was required.
- Customer time saved. Once we have enough customer usage, compare AI-generated test creation time against published benchmarks for hand-written Playwright tests.
- Framework comparison. If we add support for other frameworks or browsers, report cross-framework numbers side by side.
Related reading
- The State of End-to-End Testing in 2026, the broader industry picture this benchmark sits inside.
- The Flaky Test Report 2026, the published research on flaky tests across the industry, which is what we will compare future Diffie data against.
- AI Test Generation: How It Actually Works, the technical detail behind the system that produced these 893 tests.
- Why E2E Tests Break (And How AI Fixes It), the failure-mode background that the next benchmark will quantify.
Frequently Asked Questions
Why is this called a "lab benchmark" rather than a customer benchmark?
The dataset is dominated by tests created by Diffie founders, internal lab scenarios, and a smaller number of early user-created tests, between September 17, 2025 and April 26, 2026. We will publish a customer-production benchmark once we have enough volume from outside our own usage. Calling it a lab benchmark up front keeps the framing honest.
Why exclude tests with generation runs longer than one hour?
Fourteen records had generation durations above 3,600 seconds. The longest was about 16 days. These are almost certainly stuck jobs, interrupted background runs, or legacy timing anomalies, not real generation latencies. Including them would skew the median enough to mislead readers, so we report headline timing on the cleaned set of 541 records.
Why is first-run pass rate calculated only on 387 records out of 893?
Of the 893 generated test records, only 387 had a recorded first-run pass or fail at the time of the export. The other 506 either had no run history yet or had a first run still in flight. We did not want to penalize the pass-rate metric by counting unexecuted tests as failures, so we limit the pass-rate denominator to executed tests and disclose the exclusion explicitly.
Why no self-healing success rate?
Our schema does not yet record self-healing attempts as a distinct event. There is no field that captures "the runner detected a failure, regenerated the script, and re-ran it." Without that signal, any self-healing claim would be guessed from indirect signals like spec version bumps, which is not strong enough for a public number. We are adding explicit healing-attempt tracking before the next benchmark.
Why no failure pattern breakdown?
Of 99 final-failed tests in the export, only 5 had usable error text in the database. The other 94 had empty error fields, so any failure-category histogram would be 95% "unknown" and would mislead readers about which failure modes dominate. We are fixing the runner to write structured error data on every failure, which will make the next benchmark publishable.
How were category labels assigned?
Each test record was classified into one of thirteen application categories by Claude Haiku 4.5, based on the test name and description. Categories were predefined: marketing site, auth flow, SaaS dashboard, AI chat or assistant, developer tools, internal tool, productivity, ecommerce, fintech or payments, marketplace, media or content, social or community, and other. Spot checks against the source records matched the LLM labels in the cases we reviewed.
What framework and browser were these tests run in?
All 893 tests are AI-generated Playwright JavaScript scripts executed against Browserbase Chrome. Diffie does not currently support other frameworks or browsers, so framework and browser are constants in this dataset, not variables.
Will Diffie publish the underlying CSV?
Not in this version. The export contains tenant identifiers, user-provided test descriptions, and embedded application URLs that we cannot publish without coordinating with the customers who created them. We are open to a redacted release for researchers on request.
Written by Anand Narayan, Founder of Diffie. First engineer at HackerRank, CEO at Codebrahma.
Last updated April 28, 2026