The test that saved the refactor

A field note on a big agent-driven refactor that should have broken everything, and the tests that didn't let it.

Stuart LeoJune 8, 20263 min read

I handed an agent a refactor big enough that I expected to spend the next day cleaning up after it. Instead it was boring — in the best possible way — and the reason was a test suite I almost didn't bother writing. This is the field note on why I now refuse to let an agent refactor without one.

The scary refactor

The job was a real one: pull a tangle of booking logic out of a route handler and into a proper module, touched by a dozen call sites. The kind of change where everything looks fine afterward and then something subtle is broken three screens away. I've been burned by exactly this before, by my own hand. Handing it to an agent felt like handing a stranger the keys to a car with bad brakes.

Why I wrote the tests first

So before letting the agent near it, I spent twenty minutes writing tests that pinned down the behaviour I wanted to keep: bookings refused at capacity, the refund window enforced, the confirmation email sent once. Not testing the new structure — testing the behaviour that must survive the restructure. A safety net under the trapeze.

This is just acceptance criteria in executable form: a runnable definition of "still correct" the agent could check itself against.

Letting the agent go red-green

Then I let the agent do the refactor with one instruction: the tests stay green the whole way. It moved the logic, updated the call sites, and re-ran the suite after each step. The disciplined red-green-refactor loop that's tedious for a human is effortless for an agent — it just runs the tests, every time, without sighing about it.

Stuart Leo

The test suite turned "I hope the agent didn't break anything" into "the suite is green, so it didn't." That swap is the whole point.

Catching the one real regression

It wasn't flawless — and that's exactly why the tests mattered. Partway through, the agent moved the capacity check to the wrong side of a conditional, and the "refused at capacity" test went red. The agent saw it, diagnosed it, and fixed it, all inside its loop, before I ever looked. Without that test, the bug ships and I find it in production a week later. With it, the regression lived for about ninety seconds and never left the branch.

Why I don't refactor without tests now

The lesson stuck. A big refactor with no tests, handed to an agent, is a gamble — the output looks right and you have no way to know. The same refactor with tests is a chore, in the good sense: mechanical, verifiable, safe. The twenty minutes I spent writing tests first wasn't overhead. It was the thing that made the whole refactor a non-event.

A test suite turns an agent's big refactor from a gamble into a chore — in the good way.

Start here: see test-driven development with AI agents, acceptance criteria the agent can check, or read the method.

Test-driven development with AI agents

TDD turns an agent from hopeful to reliable: write the test first, let the agent go red-green-refactor at machine speed. How to run it, and what it needs.

Write acceptance criteria your agent can check itself

The difference between an agent that drifts and one that self-verifies is acceptance criteria you can run. How to write briefs with criteria the agent checks itself.

The 30-minute debug limit: when to stop the agent

An agent stuck on a bug will burn an hour going in circles. The pilot's move is a time limit and an escalation — here's how to set and use one.

All articles