Much of the discussion about AI agents focuses on short, well-defined tasks. Answer a user's question. Categorize a ticket. Process a form. These are bounded problems where an agent makes a few decisions and delivers a result.
But the real promise - and the real challenge - lies elsewhere: agents that run for hours or days, tackling complex tasks autonomously. An agent that doesn't just answer a question about your codebase but implements an entire feature. An agent that doesn't just categorize a support ticket but handles the entire customer journey. An agent that doesn't just extract data from a document but manages an entire research project.
This is where the value is. And this is where the structure problem becomes even more critical.
The Compounding Problem
The math here is simple but easy to miss.
Say an agent makes decisions with 99% accuracy - quite good by most standards. Over a short task requiring 10 decisions, the probability of getting everything right is 0.99^10 ≈ 90%. Acceptable.
But scale to a complex task requiring 100 decisions: 0.99^100 ≈ 37%. Now you're failing more often than succeeding.
At 500 decisions: 0.99^500 ≈ 0.7%. Essentially guaranteed failure.
This is about error compounding, not agent capability. Any process with a non-zero error rate will eventually fail if it runs long enough without correction. The more capable the agent, the longer it can run - but without error correction, there's always a horizon beyond which failure becomes inevitable.
And real-world tasks have error rates much higher than 1%. Context gets lost. Ambiguities get resolved incorrectly. Edge cases don't match training data. State drifts from what the agent expects. Even the best models operating on realistic tasks make mistakes far more often than 1 in 100.
This is why long-running agents can't just be "better short-running agents." The challenge isn't raw capability - it's the accumulation of errors over time.
Not All Errors Are Equal
The compounding math assumes errors are independent and equally damaging. Reality is messier—which creates both problems and opportunities.
Errors that compound rapidly:
- State corruption (wrong data propagates to every subsequent decision)
- Path-dependent mistakes (choosing the wrong approach wastes all work that follows)
- Silent failures (errors that don't surface until much later, when correction is expensive)
Errors that are more forgiving:
- Reversible mistakes (can be undone without cascading effects)
- Isolated failures (affect one branch of work, not the whole task)
- Detected failures (caught immediately, allowing retry before propagation)
This distinction matters for system design. You want to:
- Make the dangerous errors (state corruption, silent failures) structurally impossible or immediately detectable
- Ensure most errors fall into forgiving categories through checkpointing and reversibility
- Invest error-correction effort where compounding is worst
A 5% error rate with good error detection and recovery might outperform a 1% error rate where errors are silent and irreversible. The structure around errors matters as much as the raw rate.
Why "Smarter Agents" Don't Solve This
A natural response: just make the agents smarter. Reduce the per-decision error rate and you can run longer before hitting problems.
This helps at the margins. If you reduce errors from 1% to 0.1%, you can extend the viable task length by roughly 10x. That's real progress.
But it doesn't change the fundamental dynamic. At 0.1% error rate, a 1,000-decision task still has only a 37% success rate. You've moved the wall, not removed it.
And there are limits to how low error rates can go. Agents operate on noisy, incomplete inputs. Instructions are ambiguous. Context windows are finite. The world doesn't match training data perfectly. Some baseline error rate is irreducible given the nature of the task.
The solution isn't to push error rates to zero - that's not achievable. The solution is to detect and correct errors before they compound.
The DNA Lesson
Nature solved this problem billions of years ago.
Every time a human cell divides, it copies 3 billion base pairs of DNA. The enzymes doing this copying - DNA polymerases - have an error rate of about 1 in 10,000 to 1 in 100,000. Sounds good, but at 3 billion operations, that would mean 30,000 to 300,000 errors per cell division. Catastrophic.
Yet the actual error rate is about 1 in a billion. How?
Three layers of error correction:
Proofreading. The polymerase itself checks each nucleotide immediately after adding it. If something doesn't fit, it backs up, removes the error, and tries again. This catches about 99% of initial errors.
Mismatch repair. After replication, separate enzymes scan the new strand for errors that slipped through proofreading. They can identify and excise incorrect sections for re-synthesis.
Additional repair mechanisms. Various other systems catch damage and errors that accumulate over time.
DNA replication isn't accurate because the polymerase never makes mistakes. It's accurate because mistakes get caught and fixed before they propagate.
Long-running agents need the same approach.
What Error Correction Looks Like for Agents
Reliable long-running agents operate within systems that provide:
Observable outcomes: The agent takes actions and receives unambiguous confirmation of success or failure. Not "that seemed to work" but actual verification that operations completed correctly.
Visible errors: Structured feedback makes errors detectable. Error messages, failed tests, validation checks—something concrete identifying what went wrong.
Recoverability: The ability to back up, undo actions, try different approaches, fix broken paths. Errors become correctable rather than catastrophic.
Checkpointing: State is saved at known-good points, enabling resumption from recent checkpoints rather than requiring restart from the beginning.
Reconciliation: Even with strong error detection, some errors escape notice initially. Structured logging enables post-hoc inconsistency detection. Silent errors—errors that go undetected—are rare.
Coding agents exemplify this pattern. Code is executed, producing structured feedback: compiler errors, test failures, runtime exceptions. Each error is specific and actionable. The agent iterates—fix, run, observe. This cycle can continue for hours because errors are caught and corrected, not accumulated. (When agents get stuck in loops, it signals that feedback wasn't clear enough—the solution is better structure, not abandonment of the approach.)
Without this structure—if the agent wrote code into void with no feedback until the end—even the most capable model would fail on non-trivial tasks. Tests and compilers aren't compensating for model weakness. This structure is what enables strong models to be useful at scale.
The Primitives Matter
As agents take on longer-running tasks, the primitives they have access to become more important.
Consider two ways an agent might interact with a system:
Unstructured: The agent generates natural language descriptions of what it wants to do. A downstream process interprets these and tries to execute them. Errors surface as vague failures or, worse, silent corruption.
Structured: The agent calls specific operations with defined inputs and outputs. Each operation validates its inputs, executes atomically, and returns clear success/failure. State changes are explicit and observable.
For short tasks, the difference might be marginal. The agent can brute-force through a few ambiguous interactions.
For long-running tasks, the difference is everything. Structured primitives give the agent something to push against - clear feedback, defined operations, observable state. The agent can reason about what happened and what to do next. Unstructured interactions accumulate ambiguity until the agent is hopelessly confused about what state the system is actually in.
This is why building good agent systems isn't just about model capability. It's about giving agents the right primitives to work with.
The Essential Primitives
Effective long-running systems tend to include:
Atomic operations: Complete success or clean failure, with no partial states that could corrupt the system.
Validation: Malformed requests are caught before execution rather than producing side effects.
Observable state: The agent can verify what happened and know what it has to work with moving forward.
Specific error messages: When things fail, the error explains what went wrong and why, not just "failed."
Checkpointing: Progress is preserved at defined points, enabling recovery rather than restart.
Escape hatches: When the agent gets stuck, it can escalate to humans rather than entering unproductive loops.
These aren't scaffolding for weak models but infrastructure enabling strong models to operate at scale. The model provides reasoning; the structure provides error correction that makes extended reasoning reliable despite noise and ambiguity.
The Capability/Structure Interaction
There's a nuanced relationship between agent capability and structural requirements.
More capable agents can recover from some errors that would stump less capable ones. They can reason about unexpected situations, try alternative approaches, work around problems. This is real value - it extends the envelope of what's achievable.
But capability without structure hits a ceiling. No matter how good the agent's recovery abilities, if errors are invisible, there's nothing to recover from. If state is ambiguous, reasoning about it is guesswork. If operations have undefined behavior, planning is impossible.
Structure without capability also hits a ceiling. The most beautifully designed system is useless if the agent can't understand how to use it, can't reason about which operations to call, can't interpret results correctly.
The magic is in the combination. Capable agents operating within well-structured systems can accomplish things that neither could achieve alone. The structure handles error detection and correction; the agent handles reasoning and adaptation. Each complements what the other lacks.
As models get more capable, the ceiling imposed by capability rises. But the ceiling imposed by lack of structure remains fixed. The most capable model in the world can't reliably execute 1,000 operations with no feedback and no error correction. The structure isn't made obsolete by capability - it's what converts capability into reliable execution.
The Architecture of Long-Running Systems
Building reliable systems for extended tasks typically involves:
Investment in primitives: The operations available to agents matter enormously. Atomic, validated, observable operations are infrastructure that pays off across every task the agent executes.
Recoverability design: Errors become visible and recoverable rather than silent and catastrophic. Systems designed so agents detect "this didn't work" and try different approaches.
Aggressive checkpointing: Progress is saved frequently. Failures at step 847 don't require restart from step 1. Workflows are designed to be resumable.
Tight feedback loops: The shorter the interval between action and feedback, the faster errors get corrected. Agents that test changes immediately outperform those batching changes and testing at the end.
Explicit escalation paths: Some situations exceed what agents can handle. Escalation to humans is designed rather than emergent, preventing agents from entering unproductive loops.
Time-to-failure monitoring: Tracking how long agents can run before requiring intervention reveals the effectiveness of the error-correction infrastructure.
The Long-Term Picture
We're moving into an era where agents will handle increasingly complex, longer-running tasks. This is where the biggest opportunity lies—autonomous systems that can take on real work over extended timeframes.
The bottleneck for this won't be model capability. Models will get better, and their ability to reason, plan, and recover will improve. That's progress worth pursuing.
But model progress alone won't be enough. The compounding error problem is structural. You can't reason your way out of it - you need systems that detect and correct errors in real time.
The organizations that build these systems—the primitives, the feedback loops, the error correction mechanisms—will be the ones that unlock the value of long-running agents. The ones that wait for models to become "smart enough" to operate without structure will keep hitting the same wall, just a bit further out each year.
Structure is how you escape the wall.
This essay is part of a series on building reliable AI agent systems.
Overview: The Structure Problem
Previous: Why Reliable Systems Look the Way They Do
Next: Building the Interface — Building the structured interface