Writing good Diffie tests
A Diffie test is a short, plain-English script that the AI compiles into Playwright code and runs in a real browser. Two things decide whether it passes reliably or flakes: how much it tries to do, and how specific each step is. The rule of thumb: one outcome per test, between 10 and 20 concrete steps, and never a vague instruction. Every step names the exact element, the exact input, and the exact thing to verify.
The golden rule
One test verifies one user-facing outcome, in 10 to 20 steps, using one user account. If you find yourself writing “and then also”, that's a second test.
- One outcome. “User can search and see results” is one test. “Search, then add to cart, then check out, then log in as admin” is four.
- 10 to 20 steps. Shorter tests get reused. Long tests fail in ways nobody wants to debug, and the AI loses track of context past about twenty steps.
- One role. Don't mix Admin and Client in one test. Sign in as one user; if you need to verify another role, write a second test.
- One environment. Pin to a single URL. Don't make the test depend on navigating between subdomains or environments mid-run.
Be specific, not vague
The single biggest cause of flaky AI-generated tests is vague instructions. The model will fill in the blanks, and it will fill them in differently each run. Vagueness shows up in three places: the element you're acting on, the input you're providing, and the result you expect.
For every step, ask: which element, what value, what visible result? If any of those three is missing, the step is too vague. Rewrite it.
Name the element exactly
- “Click the button.”
- “Open the menu.”
- “Go to settings.”
- “Click the button labeled
Add to carton the product card.” - “Click the user avatar in the top-right header to open the account menu.”
- “Click
Settingsin the left sidebar (testidnav-settings).”
Quote inputs exactly
- “Enter your email.”
- “Search for a product.”
- “Try invalid input.”
- “Type the value of the
LOGIN_EMAILsecret into the Email field.” - “Type
'hummingbird t-shirt'into the search bar and press Enter.” - “Type
'not-an-email'into the Email field and click Submit.”
State the expected result exactly
- “Verify the form works.”
- “Check that the page loads.”
- “Confirm the user is logged in.”
- “Verify validation errors appear.”
- “Verify a green banner with the text
Profile savedappears at the top of the page.” - “Verify the URL changes to
/dashboardand the page shows a heading containingWelcome back.” - “Verify the user avatar in the header shows the initials
JD.” - “Verify the Email field shows the error text
Please enter a valid email address.”
Banned words. If a step uses one of these, treat it as a bug: works, properly, correctly, appropriately, as expected, etc., normally, smoothly, fine, successfully (without saying how you can see the success). They sound like instructions but they don't name anything observable.
Anatomy of a good test
Every step is either an action or an assertion. Actions tell the browser what to do (“click”, “type”, “press”). Assertions tell the browser what to check (“verify the page shows”). Alternate them: act, verify, act, verify.
Search returns matching products
1. Navigate to http://localhost:8086/
2. Verify the homepage loads with the PrestaShop logo,
a search bar in the header, and a cart icon showing 0 items
3. Click into the search bar and type 't-shirt', then press Enter
4. Verify the search results page loads with a heading
containing 'Search results'
5. Verify the product 'Hummingbird printed t-shirt' appears
in the results with price 19.12
6. Clear the search bar, type 'mug', and press Enter
7. Verify the results show three mug products:
'Mug The best is yet to come',
'Mug The adventure begins',
'Mug Today is a good day'
8. Clear the search bar, type 'nonexistentproduct123',
and press Enter
9. Verify the search results page shows a message
indicating no results were foundNine steps. One outcome (“search works for hits, multi-hit, and miss”). Every step names the exact element, the exact input, and what to look for afterward. The AI doesn't have to guess what “works” means.
The mega-prompt anti-pattern
The most common failure mode is asking for a full QA pass in one prompt. It looks thorough, but the AI has to make hundreds of decisions with no concrete checkpoints, so results drift and the run is unreproducible.
Don't do this
You are a Senior QA Automation Engineer + Exploratory Tester
+ Security Tester. Test https://operation.example.com/
with these four users (Admin, Manager, Engineer, Client).
You must execute ALL test cases TC-011 through TC-134:
- Authentication, session persistence, role separation
- Every button: click, try empty submission, try invalid
input, try valid input, verify persistence
- Every form: empty / partial / invalid / valid + reload
- Full workflow: Intake to Matching to Profile to Briefing
to Mock to Presentation to Interview to Debrief to Placement
- AI feature validation
- Security: cross-role access, URL manipulation, etc.
Report: execution summary, deep functional results per case,
defects, workflow validation, security report, AI behavior
report, final verdict (GO / NO-GO).This prompt has four problems:
- Too many outcomes. Authentication, CRUD, workflow, AI, and security are five different test domains. Each needs its own test.
- Too many users. Switching accounts inside one run is brittle. Log in, act, log out, log in again will eventually hit a session race.
- No concrete steps. “Try invalid input” is not an instruction. Which field? What value? What error message do you expect?
- No stable assertions. “Verify persistence” means different things on different pages. The AI will guess differently each run.
The fix is to decompose. The same coverage written as twenty or thirty small tests in a suite runs faster, fails more usefully, and stays green when one feature changes.
Decompose by scenario
For each feature, list the distinct outcomes. Each outcome becomes one test. Take a login form as an example.
Bad: one test for “login”
“Test login with wrong password, then with empty fields, then with valid credentials, and verify the dashboard loads and the user can also log out.”
That's four outcomes in one test. If step one fails (wrong password screen changed), the whole run fails and you learn nothing about the other three.
Good: four small tests
- TC-Login-01. Wrong password shows an error and stays on the login page.
- TC-Login-02. Submitting empty fields shows required-field validation.
- TC-Login-03. Valid credentials redirect to the dashboard.
- TC-Login-04. Clicking “Log out” from the dashboard returns to the login page.
Each test is 6 to 12 steps. If TC-Login-01 breaks, the other three still run and you see the failure isolated to the one assertion that changed. Group all four into a “Login” suite and run them together in CI.
Writing each step
A step is a sentence in the imperative. Aim for one verb per step. Be specific about the element, the input, and the expected response.
Actions
- Name the element. “Click the ‘Add to cart’ button”, not “click the button”. If the element has a
data-testid, reference it: “click the element with testidadd-to-cart”. - Quote inputs. “Type
't-shirt'into the search bar” is unambiguous. “Search for a t-shirt” is not. - One verb per step. Split “type your email and click Submit” into two steps. Each becomes its own Playwright call.
- Reference secrets by name. Use
LOGIN_EMAILandLOGIN_PASSWORDinstead of pasting real values. Diffie injects the encrypted secret at runtime.
Assertions
- Verify a visible result, not internal state. “Verify the heading shows ‘Search results’” is testable. “Verify the database wrote a row” is not (the browser cannot see your database).
- Assert after every meaningful action. Click, then verify. Type, then verify. If you don't check, you don't know which step actually broke when the test fails twenty steps later.
- Pin to exact text where possible. “Verify a message appears” is weak. “Verify the page shows the text ‘Invalid credentials’” is strong.
- Don't assert that ‘a button exists’. Click the button and verify what happens. Existence proves nothing about behavior.
Forms: one validation case per test
Forms are where the mega-prompt pattern shows up most often. The temptation is to test empty, partial, invalid, and valid inputs in a single run. Split them.
- Empty submission shows required-field validation on every required field.
- Invalid email format shows the email-specific error.
- Valid submission succeeds, redirects, and shows the success state.
- Persistence after submitting, reload, and verify the data is still there.
Four tests, each 8 to 15 steps. If you ship a regression that changes the email validation message, exactly one test fails and tells you where.
Multi-role and multi-user workflows
Workflows that span roles (Admin creates a record, Client sees it) are real and worth testing. They're also the easiest tests to break by writing them as one long run.
Pattern: split into per-role tests, then add one short handoff test that explicitly tests the boundary.
- TC-Intake-Admin-01. Admin logs in, creates an intake record, verifies it shows in the admin list.
- TC-Intake-Client-01. Client logs in (with a known pre-seeded record), verifies the record is visible on their dashboard.
- TC-Intake-Handoff-01. Admin creates a record tagged with a unique marker, logs out, logs in as Client, verifies the same marker is visible. This is the only test that switches accounts mid-run, and it stays under twenty steps.
Group related tests into a suite
Diffie suites are how you get the coverage of the mega-prompt without the fragility. Create a suite called “Auth” or “Checkout”, add the small tests that cover that area, and run the suite from CI on every PR. When one test in the suite fails, you get a recording and a diff of exactly that scenario.
See Run a test suite in CI for the workflow setup.
A template you can copy
Use this as a starting point. Replace the bracketed placeholders. Aim for 10 to 20 steps; if you can't fit the test in that range, the test is doing too much.
Scenario: [single outcome in one sentence]
Role: [one user, e.g. Admin signed in via LOGIN_EMAIL]
URL: [pin a single environment, e.g. https://staging.example.com]
1. Navigate to [URL]
2. Verify the page loads with [one or two anchor elements
that prove you're on the right page]
3. [Action: click / type / press, naming exact element and value]
4. Verify [the visible result of step 3]
5. [Next action]
6. Verify [result]
... (keep alternating action + verify, 10 to 20 steps total)
Final step: Verify [the outcome named in 'Scenario'] is visible.Pre-commit checklist
Before saving a test, run it through these questions. If any answer is “no” or “multiple”, split the test.
- Does the test verify exactly one outcome?
- Is the step count between roughly 10 and 20?
- Is every step either one action or one assertion?
- Does every action have a verification step after it, before the next action?
- For every step: is the element named (label, testid, or aria), the input quoted exactly, and the expected result a specific visible thing (exact text, URL, element state)?
- No banned words? (works, properly, correctly, as expected, successfully without saying how.)
- Does the test stay signed in as one user the whole run?
- Could a new teammate read step 1 and run the test by hand without asking questions?