A few months ago we shipped an inbox triage agent for a Melbourne insurance brokerage. Forty-person team. About 800 inbound emails a day across the support and claims inboxes. Mix of customers, insurers, vendors, and the occasional aggressive lawyer. The brief was straightforward: classify, route, draft a reply, escalate when confidence drops.

We built it. We rehearsed it against three months of historical email. We deployed it in shadow mode for two weeks where it produced suggested actions a human approved. Then we cut it over.

This is the post-mortem on what broke first. Not what we were proudest of. What broke.

Day three: the urgency classifier was too good

The agent was correctly identifying urgent emails and routing them to a “needs immediate attention” queue. Within 72 hours, the urgent queue had grown to the point where no human could keep up with it. We’d inadvertently revealed that the team had been operating in a constant state of triaging-by-vibe, ignoring most things, and getting away with it because senders followed up. The agent surfaced the actual workload. The team’s first reaction was that the agent was wrong. It wasn’t. We had to redefine “urgent” to mean “actually needs a response in 24 hours” rather than “the sender used the word urgent.”

Day five: the email signatures broke retrieval

The agent was using a retrieval-augmented setup against the brokerage’s policy documents. It was confidently citing the wrong policies for a particular customer segment. The cause: every internal email had a 14-line legal disclaimer in the signature, and the embedding model was matching customer queries against the disclaimers more often than against the actual content. We fixed it by stripping signatures before embedding, which sounds obvious in hindsight and didn’t show up in any of our test data because the test data was retrieved through the API, which strips signatures automatically.

Day eight: the BCC field killed an entire conversation thread

A senior broker had a habit of BCCing the agent’s monitoring inbox on outgoing customer emails for “the record.” The agent started replying to those threads, thinking it had been added to the conversation. The broker’s customer received an unsolicited reply from “the brokerage” with a politely-worded misunderstanding of their query. The fix was a sender-allowlist for the agent’s response path, but the trust damage with that one broker took six weeks of weekly review meetings to repair.

Day twelve: confidence drift

We’d set the agent’s escalation threshold based on its confidence on the test set. After ten days of live traffic, the model’s average confidence had drifted upward. It wasn’t getting better; it was getting more sure of itself, including on the wrong calls. We added a calibration layer that compared confidence against actual human override rate over a rolling window. This is the kind of thing that doesn’t appear in the launch checklist and absolutely needs to.

Day eighteen: nobody trusted it

The agent was technically performing better than the previous manual process. The team didn’t believe it. Half of them were quietly double-checking every routing decision, doubling their workload instead of halving it. The fix wasn’t technical. We ran a weekly session where the team brought their worst examples and we analysed them together. After three sessions, the double-checking dropped. After six, the agent had genuinely earned its seat.

The summary: the model worked. The integration mostly worked. The hard part, the part that took the longest and that we’re still tuning, was the calibration of trust between the agent and the humans whose work it changed. None of this was in the spec. All of it was the project.

If we were doing it again, we’d plan more rehearsal time and more weekly review meetings, and we’d budget for the trust-building as a real workstream. We’d also strip email signatures earlier.

The agent isn’t shipped when it works. It’s shipped when the team has stopped second-guessing it and the people on the receiving end can’t tell.