Should I let the agent write its own tests?

For unit tests of code it just wrote — no. The agent will reverse-engineer tests from its own implementation, which catches roughly nothing. Better: write the test (or describe it in plain English) first, then have the agent implement against it. For integration and e2e tests of an existing surface, agent-written tests are fine; the implementation is already pinned down.

How much coverage is enough?

Coverage percentage is the wrong metric. The right one: 'if this PR broke the three behaviors users actually care about, would CI catch it?' That's usually 4–10 well-chosen tests per feature, not 90% line coverage. Aim for high coverage of public APIs and critical paths; the rest is line-noise.

Vitest, Jest, or Playwright?

Vitest for unit and integration in 2026 — faster, native ESM, works with TS out of the box. Playwright for e2e and component testing of real browser behavior. Jest is fine if you're already on it; not worth migrating an existing repo unless you're hitting performance walls.

What about snapshot tests?

Snapshot tests are negative-value most of the time when AI writes the code. The agent will happily regenerate the snapshot when its diff breaks one, and you'll merge it without noticing. Use snapshots sparingly, for things like rendered SQL or generated configs where the output should rarely change.

Testing AI-Written Code — A Working Strategy

CH 01

The mental model.

When you wrote code by hand, tests caught your bugs. When an agent writes code, tests do something different: they catch the model's bugs — and the model's bugs are weirder than yours.

A human writes if (x > 0) when they meant if (x >= 0). An agent writes a beautiful, well-named function that looks correct in isolation, references a config option that doesn't exist, and confidently calls a deprecated API. Your tests need to catch the second category, not the first.

Integration tests beat unit tests, by a lot. Agent-written code passes unit tests it wrote against itself. It fails integration tests that pin down real behavior at module boundaries.
Real data beats fake data. The agent's mock objects are coherent inside the test file and meaningless to your real Postgres schema. Spin up a real DB in CI.
End-to-end smoke tests are mandatory. Not for coverage. To answer the question "does the thing it built actually do the thing?" — a question unit tests cannot answer.

CH 02

Pyramid, not bell curve.

Classic test pyramid: many unit, fewer integration, a handful of e2e. AI-era pyramid is squashed differently — fewer unit, many integration, more e2e than you used to write:

Layer	Old shape	AI-era shape	Why the shift
Unit	70%	40%	Agent writes them against its own code; low signal.
Integration	20%	45%	Module boundaries are where the agent's hallucinations show up.
E2E / smoke	10%	15%	The only test that proves the feature works as a whole.

"But unit tests are fast"

Vitest runs 500 integration tests in 4 seconds on a modern Mac. The 10× speed difference between unit and integration is mostly a 2010s problem. Don't shape your strategy around 2010s constraints.

CH 03

Test contracts, not implementations.

A contract test answers "what does this function promise to its callers?" An implementation test answers "did I write the code I wrote?" The second one is useless when the agent might rewrite the same function tomorrow.

The simple rewrite: test the observable behavior of the public API, not the internals.

Bad: pins implementation

// Tests the cache library, not your code's behavior
test('fetchUser caches result', async () => {
  const spy = vi.spyOn(cache, 'set');
  await fetchUser(1);
  expect(spy).toHaveBeenCalledWith('user:1', expect.anything());
});

Good: pins the contract

// Tests the actual promise: 2nd call doesn't hit the DB
test('fetchUser only hits the DB once per user', async () => {
  const dbCalls = countDbCalls(() => {
    return Promise.all([fetchUser(1), fetchUser(1)]);
  });
  await expect(dbCalls).resolves.toBe(1);
});

The first test breaks the day the agent swaps to redis-cache instead of node-cache — even though nothing the user cares about changed. The second test still holds. That's the test you want around AI-written code: it survives rewrites that don't change behavior, and catches behavior changes the agent calls "refactors."

CH 04

Who writes which tests.

Test type	Who writes it	Why
Unit tests for new code	You (or describe-then-agent-implements)	Agent-written self-tests catch ~nothing.
Unit tests for existing code	Agent	Behavior is already pinned; agent generates coverage well.
Integration tests	Agent, then you review	Module boundaries are stable; agents are decent at this.
E2E happy path	Agent (Playwright MCP)	Mechanical; agent + Playwright is genuinely great.
E2E for the critical bug you just fixed	You	If you can't articulate the regression, you didn't understand the bug.
Snapshot / golden files	Nobody, ideally	Agent regenerates these without thinking. Net-negative most days.

The "TDD with an agent" trick

Write the test signatures (names + the assertion you care about) before the agent writes code. Have the agent implement against them. The agent gets a clear spec; you get tests that aren't tautological; reviews are dramatically faster because you compare diff against a fixed contract.

DEMO · INTERACTIVE

Live: coverage planner.

Tell it the shape of your work this sprint. It returns the test-count distribution that catches the most bugs per minute spent. All math runs in your browser.

Coverage planner Heuristic · Numbers in your browser only

New features this sprint 3

Bug fixes this sprint 5

Risk profile Production app

How much code is agent-written? 60%

Hours you have for tests 6 h

Recommended test plan 0 tests this sprint

Pick your inputs to see a plan.

E2E 0

INT 0

UNIT 0

Hours required: 0 h
You write: 0
Agent writes: 0

Read the pyramid, not the total. Two well-placed e2e tests beat thirty unit tests of agent-written helpers. If the breakdown looks unit-heavy and the agent-written share is high, you're testing the wrong layer.

PITFALLS

Common pitfalls.

"All tests passing" → ship it

When the agent wrote both code and tests, the tests passing is closer to "the agent is internally consistent" than "the code is correct." Add at least one human-written contract test per PR. Treat that one as the real gate.

Mocking the world

Agents over-mock by default — fetch, time, randomness, the database, the file system. Each mock is a place where production reality differs from test reality. Pull at least one mock out per feature and replace it with the real thing (testcontainers for Postgres, MSW for HTTP). Painful once, pays for itself fast.

Letting the agent "fix" a flaky test

An agent told "this test is flaky, fix it" will add retries, increase timeouts, or quietly skip the test. Almost never the right move. Read the flake. The flake is the bug.