Why Coding Agents Work and Clippy Doesn't: What the Evidence Shows

Pawel Zimoch · ~10 min read · Essay 10

The framework I've been describing isn't speculation. We're watching it play out in real time.

Some AI agent products have seen strong adoption. Others have flopped despite massive investment and distribution. The pattern isn't random - it maps directly to whether the domain has structure that makes agent operation reliable.

Looking at where agents are working and where they're not tells us something important about what's actually required for agent deployment to succeed.

The Coding Agent Success Story

Coding assistants and agents have seen the strongest adoption of any agent category. Claude Code, Cursor, GitHub Copilot, Windsurf - these tools have real users doing real work. Not just demos. Not just experiments. Daily production use by professional engineers.

This didn't happen by accident.

If you tried using an AI agent to write code 12-18 months ago, it was a frustrating experience. The agent would overwrite files you didn't ask it to touch. It would make changes unrelated to your prompt. It would delete code that was working fine. It would confidently produce broken builds while assuring you everything was great.

The models weren't that different - GPT-4 existed, Claude existed. What changed was the structure around them.

Today's coding agents don't just "write code." They operate within carefully designed systems:

Structured tool use. Instead of outputting raw text that might be code, agents use explicit tools: read_file, write_file, run_command, search_codebase. Each tool has defined inputs, outputs, and validation. The agent can't accidentally overwrite a file because file operations go through a controlled interface.

Defined operations. Bash commands, file system operations, test runners - these are discrete, structured operations with defined inputs and outputs. The agent composes these primitives rather than generating arbitrary output.

Error correction built in. When a test fails, the agent sees structured output: which test, what error, what line. It can parse this and respond. When a command fails, there's an exit code and error message. The structure enables the feedback loop.

Rules and constraints. Coding agents now have explicit rules: don't modify files outside the project, ask before deleting. A runtime layer processes agent actions and enforces policies for what operations are allowed, blocking unwanted actions before they execute. These constraints are enforced, not just suggested.

Background agents with checkpoints. Long-running agents save state, checkpoint progress, allow review before proceeding. The user maintains control at defined points.

Model Context Protocol and skills. Structured ways to give agents context about the codebase, the project, the conventions. Not just dumping everything in a prompt - curated, structured context.

This is the architecture I've been describing: agents operating within explicit structure, with defined operations, validation, error correction, and escape hatches. It works because the structure exists.

Why Coding Was First

It's not surprising that coding is where agentic work took off. The conditions were unusually favorable:

Domain expertise was already there. The people building coding agents are engineers. They understand the domain deeply. They know what operations make sense, what errors look like, what structure is needed. There was no domain transfer problem - the builders were the users.

The iteration loop is tight. You can deploy a coding agent improvement and see results in hours. Real usage data accumulates fast. Patterns emerge quickly. This enables rapid learning about what structure is missing and what's working.

The domain is already structured. Code is inherently formal. There's syntax that can be validated. There are tests that pass or fail. There are build systems with clear success/failure signals. The structure wasn't invented for agents - it existed because software development requires it.

Error signals are clear. When an agent makes a mistake in code, you usually know quickly. The build fails. Tests break. The program crashes. Compare this to domains where errors might not surface for weeks or months.

Users are technically sophisticated. Engineers can work around agent limitations, provide better prompts, understand failure modes. They're forgiving of rough edges that would frustrate non-technical users.

This combination - deep domain expertise among builders, fast iteration loops, inherently structured domain, clear error signals, sophisticated users - created ideal conditions for developing agent infrastructure. The structure that makes coding agents work was built through rapid iteration by people who understood what was needed.

The Clippy Problem

On the other end of the spectrum: the broad, general-purpose "assistants" being embedded in platforms like Microsoft Office and Google Workspace.

Adoption has been... underwhelming. Many users find them more annoying than helpful. They try the feature a few times, get frustrated, and ignore it.

Why? These agents lack the structure to actually do work.

A general-purpose assistant in a document editor faces an almost impossible task. What should it do? Write something? Edit something? Format something? The space of possible actions is enormous and unconstrained. The user's intent is ambiguous. There's no clear domain with defined operations.

When the user asks "help me with this document," the assistant can:

Summarize it (was that what you wanted?)
Suggest edits (to which parts? toward what goal?)
Rewrite sections (in what style? keeping what?)
Add content (about what? how much?)
Format it differently (according to what rules?)

Without structure, every interaction becomes a negotiation. The user has to figure out what to ask for, how to constrain the request, how to evaluate whether the result is good. The agent can't reliably know what's wanted because the space of possibilities hasn't been narrowed.

And when the agent gets it wrong - which it often will, given the ambiguity - the user has to fix it. If fixing and specifying the agent's work takes longer than doing it yourself, you stop using the agent. Without structure, even explaining what you want becomes laborious - the specification itself lacks the shared vocabulary that makes concise instructions possible.

This is the structure problem manifested as product failure. The underlying models are highly capable - the same ones that power successful coding agents. But capability without structure produces unreliable behavior that users don't trust.

Where General Assistants Will Work First

For these broad-platform agents to become useful, they need to find structure within the chaos.

My prediction: adoption will start where some semblance of structure already exists.

Email triage. Categorizing, labeling, and routing emails. This is a classification task with definable categories. "These are the senders I always want to see immediately. These are newsletters. These are likely spam. These need a response today." The structure can be defined, the operations are clear, errors are visible.

Form filling. Extracting information from emails or documents and populating structured forms. The form provides the structure - defined fields with expected formats. The agent's job is translation from unstructured to structured. This is exactly what agents are good at when the target structure exists.

Calendar scheduling. Finding times that work, sending invites, handling responses. Calendars are already structured - time slots, attendees, conflicts. The agent navigates existing structure rather than inventing it.

Structured document generation. Not "write me something" but "fill in this template" or "generate a status update in this format." The template provides structure. The agent fills it in.

These tasks share a pattern: they involve translation between unstructured human input and pre-existing structured systems. The structure doesn't need to be invented - it exists. The agent just needs to operate within it.

As these narrower use cases succeed, they'll expand. The calendar agent learns your scheduling preferences and starts making more autonomous decisions. The email agent starts drafting responses, not just categorizing. The form agent handles more complex documents.

The expansion will follow the boundary model: start where structure exists, prove reliability, expand incrementally into adjacent territory.

The Middle Ground: Vertical AI Companies

Between coding agents (highly structured domain, sophisticated users) and Clippy (unstructured domain, general users) there's a middle ground: vertical AI companies building for specific domains.

Harvey for legal work. Sierra for customer service. Various startups tackling specific industries with AI-powered products.

These companies have a theory, implicit or explicit: the way to make agents work is to build domain-specific structure. They're not trying to make a general assistant. They're picking a domain, learning it deeply, building the structure that makes agent operation reliable and usable. Without that structure, you spend more time explaining what you want than doing the thing yourself.

This is the playbook from The Boundary Model: learn the domain, identify implicit structure, make it explicit, build agents that operate within it, iterate based on what breaks.

The advantage over general assistants: focus. They can invest in understanding one domain deeply rather than spreading thin across everything. They can build the specific categories, operations, and validation that make their domain tractable.

The advantage over coding agents: they're tackling domains that don't already have structure, which is harder but also creates more defensible value. The structure they build becomes a moat.

Watch this space. The vertical AI companies that succeed will be the ones that crack the structure problem for their domains. The ones that fail will be the ones that assumed model capability was enough.

What This Means for the Framework

The real-world evidence supports the framework:

Structure enables reliability. Coding agents work because they operate within structure. General assistants struggle because they don't have structure to work within.

Model capability isn't the bottleneck. The same models power successful coding agents and failing general assistants. The difference is what's around the model, not what's in it.

Domain expertise matters. Coding agents were built by engineers for engineers. The domain transfer problem was already solved. Domains where builders don't understand users will struggle.

Iteration speed matters. Tight feedback loops let you discover what structure is missing and fix it. Slow feedback loops leave you guessing.

Start where structure exists. The successful adoption patterns all involve finding or building structure first. Trying to deploy agents into unstructured domains produces the frustrating experiences that kill adoption.

This isn't a prediction about the future. It's a description of what's already happening. The pattern is visible to anyone paying attention.

Looking Forward

Where does this go?

The honest answer: I don't know with confidence. But some guesses:

Coding agents will expand. The structure being built for coding - tools, rules, protocols, checkpoints - will enable more autonomous operation. Background agents that work for hours without supervision. Agents that handle entire features rather than individual tasks. The boundary will keep expanding because the foundation is solid.

Vertical AI will differentiate. Companies that build real domain structure will separate from those that are just "AI for X" wrappers around prompts. The former will achieve reliability that users trust. The latter will plateau at impressive demos that disappoint in production.

General assistants will specialize. The Clippy-style broad assistants will either find niches where structure exists (email, calendar, forms) or fade into irrelevance. "Help me with anything" is too unconstrained to be reliable. "Help me with this specific structured task" is tractable.

Structure will become product. The companies that figure out structure for important domains will have built something valuable - not just AI capability, but the domain-specific language that makes AI capability usable. That's the moat.

What I'm confident about: the pattern will continue. Agents will succeed where structure exists and fail where it doesn't. Model improvements will help at the margins, but they won't solve the fundamental problem that reliable operation requires explicit structure.

The work isn't waiting for better models. The work is building the structure that makes current models useful.

That's what we're watching happen, in real time, right now.

This essay is part of a series on building reliable AI agent systems.

Overview: The Structure Problem

Previous: The Irreducible Human