Home Benchmarks Learn Tools News
Learn · Guides · Quality

Testing AI-Written Code.

Contracts, not implementations. The right ratio of unit/integration/e2e. When to let the agent write tests — and when not to.

SPONSOR

AppSignal — Stop vibe-debugging. Every exception, every backtrace, grouped so you see patterns, not noise.

↗
On this page
  1. The mental model
  2. Pyramid, not bell curve
  3. Test contracts, not implementations
  4. Who writes which tests
  5. Live: coverage planner
  6. Common pitfalls
CH 01

The mental model.

When you wrote code by hand, tests caught your bugs. When an agent writes code, tests do something different: they catch the model's bugs — and the model's bugs are weirder than yours.

A human writes if (x > 0) when they meant if (x >= 0). An agent writes a beautiful, well-named function that looks correct in isolation, references a config option that doesn't exist, and confidently calls a deprecated API. Your tests need to catch the second category, not the first.

  • Integration tests beat unit tests, by a lot. Agent-written code passes unit tests it wrote against itself. It fails integration tests that pin down real behavior at module boundaries.
  • Real data beats fake data. The agent's mock objects are coherent inside the test file and meaningless to your real Postgres schema. Spin up a real DB in CI.
  • End-to-end smoke tests are mandatory. Not for coverage. To answer the question "does the thing it built actually do the thing?" — a question unit tests cannot answer.
CH 02

Pyramid, not bell curve.

Classic test pyramid: many unit, fewer integration, a handful of e2e. AI-era pyramid is squashed differently — fewer unit, many integration, more e2e than you used to write:

Layer Old shape AI-era shape Why the shift
Unit 70% 40% Agent writes them against its own code; low signal.
Integration 20% 45% Module boundaries are where the agent's hallucinations show up.
E2E / smoke 10% 15% The only test that proves the feature works as a whole.
"But unit tests are fast"

Vitest runs 500 integration tests in 4 seconds on a modern Mac. The 10× speed difference between unit and integration is mostly a 2010s problem. Don't shape your strategy around 2010s constraints.

CH 03

Test contracts, not implementations.

A contract test answers "what does this function promise to its callers?" An implementation test answers "did I write the code I wrote?" The second one is useless when the agent might rewrite the same function tomorrow.

The simple rewrite: test the observable behavior of the public API, not the internals.

Bad: pins implementation
// Tests the cache library, not your code's behavior
test('fetchUser caches result', async () => {
  const spy = vi.spyOn(cache, 'set');
  await fetchUser(1);
  expect(spy).toHaveBeenCalledWith('user:1', expect.anything());
});
Good: pins the contract
// Tests the actual promise: 2nd call doesn't hit the DB
test('fetchUser only hits the DB once per user', async () => {
  const dbCalls = countDbCalls(() => {
    return Promise.all([fetchUser(1), fetchUser(1)]);
  });
  await expect(dbCalls).resolves.toBe(1);
});

The first test breaks the day the agent swaps to redis-cache instead of node-cache — even though nothing the user cares about changed. The second test still holds. That's the test you want around AI-written code: it survives rewrites that don't change behavior, and catches behavior changes the agent calls "refactors."

CH 04

Who writes which tests.

Test type Who writes it Why
Unit tests for new code You (or describe-then-agent-implements) Agent-written self-tests catch ~nothing.
Unit tests for existing code Agent Behavior is already pinned; agent generates coverage well.
Integration tests Agent, then you review Module boundaries are stable; agents are decent at this.
E2E happy path Agent (Playwright MCP) Mechanical; agent + Playwright is genuinely great.
E2E for the critical bug you just fixed You If you can't articulate the regression, you didn't understand the bug.
Snapshot / golden files Nobody, ideally Agent regenerates these without thinking. Net-negative most days.
The "TDD with an agent" trick

Write the test signatures (names + the assertion you care about) before the agent writes code. Have the agent implement against them. The agent gets a clear spec; you get tests that aren't tautological; reviews are dramatically faster because you compare diff against a fixed contract.

DEMO · INTERACTIVE

Live: coverage planner.

Tell it the shape of your work this sprint. It returns the test-count distribution that catches the most bugs per minute spent. All math runs in your browser.

Coverage planner Heuristic · Numbers in your browser only
Recommended test plan 0 tests this sprint
Pick your inputs to see a plan.
E2E 0
INT 0
UNIT 0
Hours required
0 h
You write
0
Agent writes
0

Read the pyramid, not the total. Two well-placed e2e tests beat thirty unit tests of agent-written helpers. If the breakdown looks unit-heavy and the agent-written share is high, you're testing the wrong layer.

PITFALLS

Common pitfalls.

"All tests passing" → ship it

When the agent wrote both code and tests, the tests passing is closer to "the agent is internally consistent" than "the code is correct." Add at least one human-written contract test per PR. Treat that one as the real gate.

Mocking the world

Agents over-mock by default — fetch, time, randomness, the database, the file system. Each mock is a place where production reality differs from test reality. Pull at least one mock out per feature and replace it with the real thing (testcontainers for Postgres, MSW for HTTP). Painful once, pays for itself fast.

Letting the agent "fix" a flaky test

An agent told "this test is flaky, fix it" will add retries, increase timeouts, or quietly skip the test. Almost never the right move. Read the flake. The flake is the bug.

What to read next.

  • Skill Testing review SKILL.md Install in your AI tool so it audits test quality, not just test presence.
  • Guide · 01 Deploy Your Vibe-Coded App Tests are item one on the readiness checklist for a reason.
  • Guide · 07 MCP Servers for Web Devs The Playwright MCP is what makes agent-written e2e tests actually viable.
Changelog
  • 2026-05-22Initial publish.
STATUS ● BUILDING THE FUTURE
MISSION LLM RESOURCES
VERSION BETA 3.0

BUILD WITH AI. SHIP WITH CONFIDENCE.

@WEBDEVELOPERHQ ↗
TERMS / PRIVACY
FRIENDS
Authentic Jobs ↗
Web Reference ↗
Ready.dev ↗
Fullres ↗
© 2026 WEB DEVELOPER / ALL RIGHTS RESERVED