Paper rules break first

I've been running the agent-teams starter prompt I published in February on real projects for about two months now. It works. It coordinates Builder, Tester and occasionally Reviewer across long sessions, keeps the stakeholder insulated from orchestration, and produces results I'd have needed a small team to deliver a year ago.

It also kept failing the same way, over and over, in ways the prompt's own rules should have prevented.

When I started looking for the pattern, it turned out to be depressingly simple: the rules that only lived in prose were the first to erode. The ones that were wired into something — a git hook, a CI job, a file that the system refused to skip — held up. The ones that I wrote as "we will do X" stopped happening somewhere between phase two and phase three of every project. Not because the agents forgot. Because I forgot.

This isn't a new insight in software engineering. Every checklist culture eventually discovers that checklists on paper become shelfware. What surprised me is how fast it happens with AI agents. Human teams take months to stop following a convention. Agent teams take days. They don't push back when you skip the retrospective. They don't complain that the documentation update never happened. They just keep going.

So I rewrote the starter prompt. The new version is still recognisable — same role ladder, same file ownership, same Opus-lead / Sonnet-builder model strategy — but nearly every addition is a scar. Something that failed in practice, wired back in so it can't fail the same way again.

Enforcement you never actually tested

The original v2 told the team to create scripts/docs-check.sh, scripts/docs-stamp.sh, a pre-commit hook, a PR template. Good. What it didn't say was: prove the enforcement actually fires before you trust it.

Twice now, I've spent a phase and a half under gates that were silently misconfigured. The docs-check.sh I thought would scream when src/ changed without a matching topic-doc update was happily passing everything. I found out when I reviewed the diff and realised three modules had been rewritten with zero documentation updates. The gate had never fired. I'd assumed it was working because I'd written it.

The new v2 has a short pre-flight step: before spawning any teammate, deliberately violate each enforced rule and confirm the gate blocks or warns. Edit src/ without touching the topic doc — the check should fail. Skip docs-stamp.sh on push — pre-push should refuse. Open a PR without filling the template — CI should flag it. If any gate silently passes, you fix the gate now, not in session twenty when the damage is done.

A month of work under broken enforcement is indistinguishable from a month of no enforcement. It's worse, actually, because it came with false confidence.

"As needed" means "never, under pressure"

The v1 prompt said I could spawn a Security Reviewer or dedicated Reviewer "as needed." In two months I spawned the Security Reviewer exactly twice. Both times were when I was calm, the project was in a stable place, and I had the budget in my head to afford another agent. Neither was an auth change, a secrets-handling change, or a new external boundary — which is exactly when you need security review and exactly when you're too stressed to remember to spawn one.

"As needed" collapses to "never" under pressure. That's not a character flaw. That's what all soft rules do.

The new v2 converts "as needed" into explicit mandatory triggers. Security Reviewer is required for any change touching auth/authz, secret or token handling, new external network calls, new dependencies on security-sensitive paths, or any PII boundary. Reviewer is required for refactors spanning more than about ten files, cross-module contract changes, or anything the Tester flags as "implementation unclear but tests pass." Beyond those floors, it's judgment. Below them, it's not optional.

The best test of a rule isn't "would I follow this on a good day" but "would I follow this on the day I most need to."

The lead's context window is not a landfill

This one I care about the most, because it's directly downstream of the thing I've been writing about since February: context is a scarce resource, and the system that manages it best wins.

The original prompt had clear rules about who owns which files. It had no rules about what agents can dump into the lead's context window. When Builder finished a task, it would return a full narrative of what it did — sometimes with diffs pasted in, sometimes with test output, sometimes with speculation about what might break next. When Tester ran the suite, it would return stack traces, line by line. By task ten, the lead's window was half full of teammate output that added no ongoing value.

I watched a session die from this once. Everything the team produced was individually correct. The lead just couldn't see it anymore through the fog of prior agent chatter.

The new v2 imposes return-format caps: Builder ≤150 words per task report, structured — what changed, files touched, test result, any ambiguity flagged. No pasted diffs. Tester ≤150 words, citing TESTLOG.md entry IDs for defects instead of copying stack traces. Reviewer ≤100 words plus a path to the full findings file. Researcher ≤300 words, file-path grounded. If a teammate returns more, the lead pushes back and asks for a rewrite.

This isn't red tape. It's treating the lead's context as the finite resource it is. Full detail lives in git, in TESTLOG.md, in docs/reviews/ — places you can go when you need it, and that don't consume attention when you don't.

"Message the builder" is not a protocol

The original said the Tester should message the Builder directly when a defect is found. What I saw in practice was a kind of drift into chat: "hey found something weird in the listing endpoint, can you look at it." Builder: "taking a look." Fifteen minutes of round-trip before the actual nature of the defect surfaces.

The new v2 pins the handoff shape:

SEVERITY: critical | major | minor
WHAT: one-line description
WHERE: file_path:line (or TESTLOG.md:ID)
REPRO: one line or TESTLOG ref
ASK: specific action requested

Builder ack is one line: "Taking — ETA short" or "Disagree — escalating to lead with reason." No "looking into it." Ambiguous acknowledgments are how small issues turn into three-exchange debugging conversations that should have been fifteen seconds.

Retrospectives are too late

The original v2 required a retrospective in LESSONS_LEARNED.md at the close of every phase. Good. But retrospectives only catch what already happened. By the time you notice the team misread the spec, the team has shipped against that misread for a full phase.

The new v2 requires a pre-mortem at phase kickoff. Five to ten bullets in .claude/plans/phase-N-<slug>.md: what could go wrong here, what would we see first, what are we assuming that might not hold. Ten minutes. The output is almost never surprising — but writing it forces the ambiguity to the surface before it becomes a defect.

The retrospective is still there. The pre-mortem is the one that pays for itself.

HANDOVER.md is a single point of failure

HANDOVER.md is the resumption contract — written at session end, read at session start, overwritten each time. That overwrite has always made me nervous. One session where I forget to regenerate it and the project loses its memory of what was going on.

It finally bit me. I reset mid-session to escape context rot, regenerated HANDOVER.md, but the last handover — the one with three specific edge cases the Tester had surfaced — was gone. The file had been overwritten by a much more generic version. I'd handed the fresh session a cleaner but emptier document.

The new v2 has HANDOVER-LOG.md, append-only. Before overwriting HANDOVER.md, you prepend the outgoing version to the log with its date. Thirty seconds of work. Cheap black-box recorder against the day the overwrite goes wrong.

Specialists linger

One thing I didn't expect: when you spawn a Reviewer for phase two and keep them around for phase three "because they already understand the project," they start confabulating. By phase four their context is an incoherent mess of three different feature sets and they're making review calls based on half-remembered architecture from two phases ago.

Long-lived specialists accumulate context debt silently. The new v2 is explicit: dismiss every specialist at phase close. Respawn fresh next phase, brief them again, let them read CLAUDE.md cold. The cost of onboarding is smaller than the cost of their decay.

What this is really about

Stepping back, every change in the new v2 follows the same pattern. A rule that only lived in prose got rewritten as either:

A gate that fails loudly when violated,
A format that's too rigid to drift,
Or an artifact in a durable location that doesn't depend on anyone's memory.

Pre-flight makes the tooling prove itself. Spawn triggers make specialist review non-optional where it counts. Return-format caps make context pollution visible. The messaging protocol makes handoffs too structured to wander. Pre-mortems catch ambiguity before it ships. HANDOVER-LOG.md makes overwrites reversible. Specialist dismissal makes context hygiene automatic.

None of this is new in software engineering. It's the old lesson that any process rule not enforced by tooling erodes, rediscovered at agent speed. What's new is that agent speed makes the lesson cheap to learn. I got two months of scars out of what would have been a year of damage in a human team.

The starter prompt isn't finished. It's a living document, and the next two months will probably add their own scars. But I think the shape of the update is more interesting than any individual change. If you're building systems on top of LLMs, the question isn't what rules do you want your agents to follow. It's which of your rules could fail silently, and how would you even notice.

Paper rules break first. The fix is to stop writing them on paper.

The refined v2 is available: STARTPROMPT-AGENT-TEAMS-v2.md. Same shape as before; more of the rules now fail loudly when you break them.

#Enforcement you never actually tested

#"As needed" means "never, under pressure"

#The lead's context window is not a landfill

#"Message the builder" is not a protocol

#Retrospectives are too late

#HANDOVER.md is a single point of failure

#Specialists linger

#What this is really about

Related Posts

A starter prompt that actually remembers what you told it

My .md files vs Claude's memory tool: a practitioner comparison

When one context window isn't enough