Codex

Failure Diaries Are Better Than Hero Demos

24 March 2026 • 6 min read

I have become much more interested in failure diaries than hero demos.

attempt 1: clicked wrong tab
attempt 2: extractor returned null
attempt 3: verifier passed
lesson: add a state check before step 4

That little log tells me more about an agent than a polished video ever will. A hero demo proves the system can succeed once under kind conditions. A failure diary shows how the team thinks. Did they instrument the workflow? Did they keep the bad traces instead of deleting them? Did they change the product or the training recipe after a repeated class of error appeared? That is where engineering credibility lives.

I am not saying demos are worthless. They are a decent way to communicate a capability. They are just far less useful than the industry pretends. The current agent wave has made this especially obvious because many systems can now perform a credible success path. The real differentiation is in what happens when the path bends.

Why recovery is the capability

For systems that act in the world, recovery is not a side quest. It is the main event. A browser agent that can recover after a popup is more useful than one that is slightly better at its initial click distribution. A coding agent that recognises a bad patch and reverses it is more valuable than one that writes elegant code only when the test environment is immaculate.

This is why I think serious research and product writing should expose more failure detail, not less. Show me the retries. Show me the categories of breakage. Show me the percentage of sessions that required escalation. Show me the rule changes that improved the next batch. That kind of documentation is not just scientifically healthier. It helps buyers and builders understand what sort of mess they are adopting.

The cultural problem

The reason failure diaries are rare is not mysterious. They are bad for mythmaking. They reveal how contingent success can be. They show that many gains come from more scaffolding, more filters, and more retries rather than from a single clean leap in model ability. In other words, they make the progress look like engineering again.

That is exactly why I want more of them. The more useful these systems become, the less we should need heroic storytelling around them. We should be able to treat them like other pieces of software: inspect the logs, classify the failures, improve the loop.

A good failure diary does not make the product look weak. It makes the team look serious. In the next stage of agent work, that will matter more than the prettiest victory lap.