Shipping software with AI you don't trust

Why nothing an agent says about its own work counts as evidence.

2 June 2026 · 14 min read · By Kam Low, CTO

Last month, Codex reviewed a diff in our send pipeline and returned this:

F2: create_message_record! silently rescues any exception, leaving the queued reservation in place after a successful provider call. expire_stale_reservations! will then force-fail a message whose email was actually delivered.

The bug was real. We'd have shipped it. The diff had already passed local tests, a Claude self-review on the implementation phase, and a static analysis pass. The model that wrote the code had looked at its own work and called it done. That is the part that does not count. What caught the bug was a different model with no stake in the diff, and the sentence above is what blocked the change and got it rewritten.

That is the whole idea, in one bug. We run this lane on a single rule: nothing an agent says about its own work counts as evidence. Not "tests pass", not "phase complete", not "looks good to me". Every claim has to be backed by something that cannot lie about it. A durable session ledger holds the receipts. A different model reviews the diff. Humans approve the spec and the merge. The model that writes the code is a vendor we swap every few months, because once the truth lives in the ledger and not in the agent, which agent did the work stops mattering. 86 governed specs and 633 commits through this lane in the last 60 days. The pieces compose, and the boundaries between them are load-bearing.

The lane

Slack reaction / Sentry alert / GH issue
   ↓  normalised to runx.thread.v1
runx issue-intake     →  classify + route
   ↓
scafld plan           →  deterministic spec
   ↓
scafld harden         →  Codex adversarial pass on the draft
   ↓
human approval        →  draft → approved → active
   ↓
scafld build          →  Codex or Claude, TDD phases, session ledger
   ↓
scafld review         →  --provider codex (default), --provider claude
   ↓
local QA              →  second human approval
   ↓
gh pr create          →  reply on origin Slack thread, mirror to #shipped
   ↓
human merge           →  outcome story posted back to origin thread

Every transition writes a signed receipt. The spec re-renders from the session, not from anything an agent claims it did. If a build phase says it passed, the session has to back that up with the actual command output. Agent self-reports are not evidence.

1. Intake

A bug report shows up in #issues. Someone reacts with 🐛. That reaction is the start of a governed lane.

The engineering value is narrow and specific. The lane ingests one schema. Intake is the input validation for everything that follows. By the time a thread reaches plan it is already a typed, classified, routed unit of work, so the stages after it assume a clean contract instead of re-parsing the world. The mess gets handled once, at the edge, and never seen again.

Slack is deliberately the only place a dev has to look. Most engineering teams have devs watching six surfaces at once: GitHub notifications, Sentry alerts in email, support tickets in Intercom, oncall pages, DMs, and Slack itself. By midday you've answered the same question in three places and missed something important in a fourth. We made Slack the floor. Every inbound surface gets piped into a thread in #issues before a dev sees it. Sentry alerts post there. Support escalations post there. GitHub issues echo into the same channel. The triage surface is one tab.

This is the part that really matters to us as a team. The conversation that started the work stays connected to the work. When the PR opens, runx replies in the original Slack thread. When it merges, it replies there too. There is no "what was the original context for this?" archaeology, no spelunking through a ticket system to find the customer who asked, no separation between the chat that surfaced the problem and the commit that fixed it. The team's working memory and the codebase's working memory point at the same place.

The cost is real. We have one channel that cannot break. If Slack is down, intake stops. If the wrong people leave a critical channel, signals get missed. We have accepted that single point of failure as the price of not living with multi-channel triage. Most days it is the trade we would make again.

Under the hood, runx accepts four inbound surfaces and normalises every one to a single schema, runx.thread.v1. The schema is what lets Slack be the only surface above it. A Sentry alert that started life as a webhook payload and a customer note that started life in a support tool both arrive at the lane as the same shape, because the adapters did the translation work at the edge.

runx issue-intake then classifies the thread:

{
  "category": "bug | feature | docs | infra | security",
  "severity": "low | medium | high | critical",
  "recommended_lane": "issue-to-pr | work-plan | reply-only | manual-review",
  "commence_decision": "approve | hold | reject",
  "target_surfaces": ["api", "app", "mcp"],
  "slack_origin": { "channel": "#issues", "ts": "1747..." }
}

Most of that object is out of scope here. The engineering lane acts on two of those lanes: issue-to-pr and work-plan, the ones that become governed code. It also honours manual-review, the stop signal covered below. Everything else (reply-only, the hold and reject triage, routing a support escalation to a human) lives one layer up, in how we triage the rest of the inbound surface. That is its own piece. This article starts when a thread is classified as engineering work and enters the lane.

target_surfaces is what tags the right reviewer. A change touching ["mcp"] goes to the engineer who owns MCP. A change touching ["api", "billing"] goes to whoever owns billing. The mapping lives in config. We don't think reviewer routing is engineering work; it's a lookup table that needs to be honest about who knows what.

Two details that matter more than they look:

manual-review is a real lane, not an exception path. Risky or ambiguous threads stop there. They don't get forced into a PR they shouldn't be. The most common mistake we see other teams make is treating "agent can't decide" as a problem to fix instead of a signal to honour. When the classifier hedges, we want it to hedge loudly.

Stamped comments () make every update idempotent. Re-running intake on the same thread updates the existing comment instead of producing a wall of duplicates. We learned this the way you'd expect; a webhook retry storm in March posted the same triage comment 47 times on a single issue. The stamp is the cheapest possible fix that survives every retry pattern we've thrown at it since.

Explicit overrides: /runx issue-intake to start a lane from a comment, /runx rerun to redo the classification when context changes.

2. Plan

scafld produces a deterministic Markdown spec before any code is written. Objective, scope, files impacted, risks, phases, rollback, acceptance criteria. The spec is the contract between what was requested and what gets shipped. If the spec is wrong, the change is wrong by definition.

Two ideas from scafld carry most of the load, and the order matters:

Sessions. Every criterion attempt, phase transition, and verdict lands in a durable ledger. If an agent claims a phase passed, the session has to back that up with the actual command output. Agents are bad witnesses to their own work. The session ledger is the thing that doesn't lie. It outlives scafld, runx, and whichever model is best this quarter. Build the part that forces an agent to show its work instead of asserting it, and you can swap everything else around it. This is the load-bearing idea. Everything else in this post is a way of enforcing it.

Gates. Specs move draft → approved → active → review → completed. A human approves at draft. An adversarial reviewer verdicts at review. There is no path around either, and no --force flag that ships without leaving a receipt of the bypass. The gates are where the rule gets teeth: a claim with no receipt does not move.

The framing scafld uses for itself, "the agent is replaceable, the protocol is not", is the entire thesis of how we run this. We don't trust agents to plan. We trust them to build against a plan a human approved.

We run an adversary at both ends of the pipeline. The first one runs here, before any code exists. scafld harden hands the draft to a different agent, one that did not write the spec, with the same xhigh reasoning effort as the diff review but pointed at the plan instead of the code. It looks for unstated dependencies, plausible failure modes the Risks section missed, scope creep dressed up as acceptance criteria. The point is the fresh context. The agent that wrote the spec has already talked itself into the spec and cannot see the holes in it. An agent with no investment in the plan reads it cold and can.

But the sharper thing harden does is challenge the spec's very right to exist. This is the part most people get wrong about coding with AI. Agents are very good at building specs they shouldn't be building. Hand one a spec to "add a retry around the failing call" and it will dutifully add the retry, never asking whether the upstream timeout is wrong. Hand it "patch the merge to skip duplicates" and it will write the patch, never raising that the duplicates exist because the import is broken.

harden asks the questions the builder won't. Is this a bandaid? Is the fix at the right layer? Are we adding code where we should be deleting code? Did the reporter describe a symptom instead of the cause?

A spec from May is the cleanest example we have. We needed to stop a suspended spammer's mail from leaving the system, and the draft did the obvious thing: a silent early return in Email::Sender, the provider adapter, handing back billable_accepted: false so the send vanished before it reached SES. Small, clean, done. harden read it and asked where that return value actually goes. It goes to three orchestrators, and every one of them does the same thing with it. They raise, and the message lands as failed, which the customer can see. The silent drop was the loudest outcome the system had. The spec could not do the thing it was written to do, because it was patching the layer underneath the one that owns the send decision. The rewrite moved the gate up to the orchestrators and made the block deliberate and visible to operators, instead of a flag the layer above was always going to misread.

A real share of our drafts come back from harden like that, marked for rewrite or deletion. Almost all of them would have shipped if we'd skipped the gate. They would have passed their tests, and done the opposite of what they promised.

AI without this gate accumulates technical debt faster than humans, not slower. Agents are paid in output. They build what you ask. The question of whether you should be asking is the one they're worst at.

The spec template is opinionated about one thing in particular: it forces a Risks section before any phase is written. Here's a real one, from a send-pipeline change we shipped in May:

SES v2 raw sending has different parameter names from v1 send_raw_email; tests must assert exact params.

Cohort thresholds can be too aggressive and demote legitimate customers during small-volume noise.

Cohort thresholds can be too permissive and fail to quarantine abusive traffic quickly enough.

SES tenants do not fully isolate the AWS account reputation. Aggregate bad traffic can still affect the account, so warmup limits and suppression remain required.

That last line is the one we keep coming back to. The spec is forced to admit what the change does not fix. The discipline of writing it out front is what stops us from claiming the gate is wider than it actually is.

The other rule that matters here lives in our root AGENTS.md:

Best-option bias: choose the architecture-aligned solution, not the smallest diff.

That's not a styling preference. It's the line that prevents the lane from drifting toward "what's the cheapest thing the agent can ship to close the spec." A bounded change is still allowed to be the right change, not the small one.

3. Build

scafld build runs whichever agent the developer opened that day. Opus 4.7 in the morning, gpt-5.5 in the afternoon. Same spec, same phases, same session ledger. When Claude gets stuck on a phase, Codex can pick it up at the next phase without any state translation; the session is the handoff.

Builder is a swap, not a role. The lane does not care which model wrote the code. It cares that the spec was satisfied and the session has the receipts.

Here is a test most teams never run on themselves: if your build loop can't survive swapping the agent, the loop is the agent. That's a brittle thing to ship on. We picked this shape so the answer to "which model is best for code" can change every six months without changing how we operate.

4. Review

Every change hits scafld review. It is not skippable. This is the adversary's second end. harden attacked the plan before we built it; review attacks the diff before we ship it. Same move, same reason: the agent that wrote the code has already convinced itself the code is right, so it cannot review it. A different agent, with no stake in the diff, reads it cold.

scafld review <task-id> --provider codex
scafld review <task-id> --provider claude

We default to Codex. We run both when the diff is architectural: schema changes, public API edits, anything that touches the flow engine or the MCP server.

The reason for the default is temperament. Codex is Claude's autistic uncle. It finds things that bother it that Claude would let slide. Pattern violations. Missed edge cases. Validation gaps. Schema drift. Silent rescues that paper over real errors. Claude is good at structural cohesion and big-picture risk; useful, but agreeable. Codex argues with the diff. Claude tends to agree with it. The argument is the point.

We run Codex on model_reasoning_effort="xhigh" and return three sections: CRITICAL ISSUES, IMPROVEMENTS, POSITIVE NOTES. Critical issues block the merge gate. The full output gets quoted into the PR review comment so the human merger sees the adversarial pass next to whatever Claude said about the same diff.

The F2 finding at the top of this post is what a Codex review actually reads like. Here's another from the same review:

F4: Flow::Executor#record_campaign_send_failure! will write a token-scoped failure row for any campaign step that raises (including Wait / Split), which over-counts failures in SendProgress.

That is the exact pedantic tone we want at the gate. Claude does not write reviews that look like this. We've tried. Reviews from the same model that wrote the diff aren't reviews; they're rationalisations. The model that wrote the line is the model least likely to flag the line. Single-agent review loops are theatre.

This isn't a slight against Claude. It's an honest read of what each model is for. We build with both. We review with the other.

5. Postback

The Slack loop closes with three stamped updates on the origin thread:

PR opened: link, scope summary, who's reviewing
Merge gate: review verdict, remaining risk, human merge instruction
Final outcome: merged, closed, or superseded

A mirror lands in #shipped so the engineering feed reads as one stream instead of a thousand thread fragments. One stamped comment per state. No event log. Other teams find the temptation to post every internal transition irresistible; we think that produces noise nobody reads and gates nobody trusts.

What this lane costs us

It's slower than going fast and bad. A spec takes 10 to 40 minutes to write and approve. A single-agent end-to-end shipping loop would be faster on the happy path. We pay the slowness because the happy path isn't where the bugs live.

We pay a Codex token bill on every PR. Two Codex passes (one on the spec, one on the diff), both at xhigh reasoning effort. It's not free; we haven't found a cheaper review that catches as much.

We've gotten specs wrong. Too small, too ambitious, wrong rollback, missing dependencies. scafld harden helps but doesn't eliminate the problem; a bad spec still runs to completion before anyone notices, and the cleanup costs more than the build did.

Adversarial review on diffs doesn't catch operational failures. Codex can't see what happens to your send pipeline at 2× peak load. From the same delivery spec review:

F5: Resume path returns the existing send token without re-enqueueing lost Flow::ExecutorJob rows; recovery from a dropped queue depends entirely on the 15-minute stale-expiry sweep. If GoodJob loses rows mid-send, SendProgress shows pending = N forever for 15 minutes, then expire_stale_reservations! marks them failed, never actually retrying the send.

We have a lived version of this, not just a review finding. For about 48 hours in May, Email::NotificationHandler quietly dropped every bounce and complaint it could not match back to a send. The match depended on metadata riding along in the provider's webhook, and when that metadata did not survive the round trip, the event fell into an else branch that did nothing. No row, no log, no metric. No diff looked wrong, because no diff was wrong. The bug was an absence. nitro_get_insights reported a bounce rate of zero, which is the healthiest number there is, and on the strength of it we shipped campaign 284 into a cohort that was already heavily blacklisted. A reviewer reads the lines that changed. It cannot read the line nobody wrote, and it cannot tell that a dashboard is lying by reporting nothing at all.

That's a class of bug the lane is bad at catching, and we know it. The gate is narrower than the failure surface. Anyone telling you their AI build loop catches everything is selling something.

Boundaries

Not auto-merge. Humans approve the spec at draft and merge the PR at the end. Both gates are non-negotiable. Neither is automatable today, and we have no plans to automate either.

The builder and reviewer must disagree. The whole structure assumes they will. Remove the disagreement and you remove the value. One model reviewing its own work wouldn't need a review gate to begin with.

The unit of work is the spec, not the model. AI ships bounded changes against a spec a human approved. The interesting object is the spec; whichever model wrote the lines is incidental.

Not best practice. This is the shape we landed on after the version that didn't work. The first version was a single-agent loop. It planned, it built, and it graded its own homework. One afternoon it marked an implementation phase complete, the migration in that phase was broken, and nothing in the loop was positioned to disagree with it, so the migration went to staging. We found out when staging stopped coming up.

None of that is the model's fault. We had built a loop whose only witness to the work was the thing doing the work, then acted surprised when the witness lied. Every gate in the current shape is a scar from that afternoon. The protocol is not a best practice we adopted. It is the list of ways we have already been burned.

Close

The agent is replaceable. The lane is not. Most teams will spend the next year discovering that in reverse order: picking a model, building everything around it, and finding out the hard way that the load-bearing part of their system was the part they didn't build.

The model is a vendor. The protocol is the product.