The mental model.
When you wrote code by hand, tests caught your bugs. When an agent writes code, tests do something different: they catch the model's bugs — and the model's bugs are weirder than yours.
A human writes if (x > 0) when they meant if (x >= 0). An agent writes a beautiful, well-named function that looks correct in isolation, references a config option that doesn't exist, and confidently calls a deprecated API. Your tests need to catch the second category, not the first.
- Integration tests beat unit tests, by a lot. Agent-written code passes unit tests it wrote against itself. It fails integration tests that pin down real behavior at module boundaries.
- Real data beats fake data. The agent's mock objects are coherent inside the test file and meaningless to your real Postgres schema. Spin up a real DB in CI.
- End-to-end smoke tests are mandatory. Not for coverage. To answer the question "does the thing it built actually do the thing?" — a question unit tests cannot answer.
Pyramid, not bell curve.
Classic test pyramid: many unit, fewer integration, a handful of e2e. AI-era pyramid is squashed differently — fewer unit, many integration, more e2e than you used to write:
| Layer | Old shape | AI-era shape | Why the shift |
|---|---|---|---|
| Unit | 70% | 40% | Agent writes them against its own code; low signal. |
| Integration | 20% | 45% | Module boundaries are where the agent's hallucinations show up. |
| E2E / smoke | 10% | 15% | The only test that proves the feature works as a whole. |
Vitest runs 500 integration tests in 4 seconds on a modern Mac. The 10× speed difference between unit and integration is mostly a 2010s problem. Don't shape your strategy around 2010s constraints.
Test contracts, not implementations.
A contract test answers "what does this function promise to its callers?" An implementation test answers "did I write the code I wrote?" The second one is useless when the agent might rewrite the same function tomorrow.
The simple rewrite: test the observable behavior of the public API, not the internals.
// Tests the cache library, not your code's behavior test('fetchUser caches result', async () => { const spy = vi.spyOn(cache, 'set'); await fetchUser(1); expect(spy).toHaveBeenCalledWith('user:1', expect.anything()); });
// Tests the actual promise: 2nd call doesn't hit the DB test('fetchUser only hits the DB once per user', async () => { const dbCalls = countDbCalls(() => { return Promise.all([fetchUser(1), fetchUser(1)]); }); await expect(dbCalls).resolves.toBe(1); });
The first test breaks the day the agent swaps to redis-cache instead of node-cache — even though nothing the user cares about changed. The second test still holds. That's the test you want around AI-written code: it survives rewrites that don't change behavior, and catches behavior changes the agent calls "refactors."
Who writes which tests.
| Test type | Who writes it | Why |
|---|---|---|
| Unit tests for new code | You (or describe-then-agent-implements) | Agent-written self-tests catch ~nothing. |
| Unit tests for existing code | Agent | Behavior is already pinned; agent generates coverage well. |
| Integration tests | Agent, then you review | Module boundaries are stable; agents are decent at this. |
| E2E happy path | Agent (Playwright MCP) | Mechanical; agent + Playwright is genuinely great. |
| E2E for the critical bug you just fixed | You | If you can't articulate the regression, you didn't understand the bug. |
| Snapshot / golden files | Nobody, ideally | Agent regenerates these without thinking. Net-negative most days. |
Write the test signatures (names + the assertion you care about) before the agent writes code. Have the agent implement against them. The agent gets a clear spec; you get tests that aren't tautological; reviews are dramatically faster because you compare diff against a fixed contract.
Live: coverage planner.
Tell it the shape of your work this sprint. It returns the test-count distribution that catches the most bugs per minute spent. All math runs in your browser.
Read the pyramid, not the total. Two well-placed e2e tests beat thirty unit tests of agent-written helpers. If the breakdown looks unit-heavy and the agent-written share is high, you're testing the wrong layer.
Common pitfalls.
When the agent wrote both code and tests, the tests passing is closer to "the agent is internally consistent" than "the code is correct." Add at least one human-written contract test per PR. Treat that one as the real gate.
Agents over-mock by default — fetch, time, randomness, the database, the file system. Each mock is a place where production reality differs from test reality. Pull at least one mock out per feature and replace it with the real thing (testcontainers for Postgres, MSW for HTTP). Painful once, pays for itself fast.
An agent told "this test is flaky, fix it" will add retries, increase timeouts, or quietly skip the test. Almost never the right move. Read the flake. The flake is the bug.