Building the Interface

Pawel Zimoch · ~19 min read · Essay 04

When a domain has well-defined operations—clear inputs, outputs, preconditions—an agent can be precise about what it wants to happen. It can say "ship this order" instead of fumbling with field updates. It can check whether the operation applies before trying it. It can see exactly what went wrong when it fails.

Precision plus feedback is programming. This is why coding assistants work: code has syntax that validates, types that constrain, tests that give feedback. The agent composes operations, sees errors, adjusts. Structure plus feedback.

If you want agents to work in your domain the same way, you need to give them the same thing: a language they can program in. Which means someone has to design that language.

This isn't optional. If you don't design a language intentionally, you get one of two outcomes: either the agent writes in a general-purpose language like Python (and can do anything, including break things), or the agent uses an ad-hoc collection of tools that don't compose coherently (and you get unpredictable behavior at the seams).

A well-designed domain-specific language gives you the best of both worlds: the agent has real expressive power - it can compose operations, handle conditionals, build complex workflows - but only within boundaries you define. The language itself is the sandbox.

This piece is about how to design that language. Not theory - practice. What operations to include, how to define them, how to handle composition and state, what to do when the language needs to grow.

What You're Actually Building

A domain-specific language for agents has three components:

Vocabulary: The operations the agent can perform. In a support ticket system: classify, assign, respond, escalate, resolve. In an expense system: create_expense, attach_receipt, set_category, submit. Each operation is a verb in the language.

Type system: The entities and values the operations work with. Tickets, customers, categories, priorities. What types exist, what values are valid, what can be passed to what. The type system constrains what can be said - you can't assign a ticket to a category, because the types don't match.

Runtime: The validation and enforcement layer. Preconditions checked before operations execute. Invariants enforced after. State transitions validated. Hooks triggered at specified points in execution. The runtime is what makes the language safe - it catches invalid programs before they cause harm.

If you've built APIs before, this should feel familiar. An API is a kind of DSL. The difference is intentionality: instead of exposing whatever operations your backend happens to support, you're designing a coherent language for expressing workflows in your domain.

This matters because not all structure is created equal. You can build a DSL that tangled and incoherent—operations that bypass validation, entities with circular dependencies, rules that contradict each other. That bad structure creates a false sense of safety. The agent respects the boundaries until it doesn't, and then things fail in ways that are hard to understand or recover from.

What distinguishes a good DSL from a bad one is coherence. Operations that combine predictably. Entities with clear responsibilities. Rules that actually hold when enforced. States that can't contradict each other. When you read the DSL, you should be able to predict what will and won't happen.

Building a coherent DSL is harder than building any structure. But it's worth it. A bad DSL compounds problems at machine speed. A good one enables agents to operate reliably at scale.

Start With Entities

Before defining operations, you need to understand what they operate on.

Domain entities are the nouns in a system. In a support ticket system, these might include:

Each entity has properties worth making explicit:

Identity distinguishes instances. A Ticket has an ID; a Customer has an ID and perhaps an email. Some entities are first-class objects with their own identity. Others are values, defined entirely by their contents.

State describes where an entity is in its lifecycle. A ticket might progress through states: new, triaged, in_progress, waiting_on_customer, escalated, resolved, closed. Not every entity has meaningful states—a Category might just be a value—but entities that do undergo state changes need explicit enumeration.

Attributes are the fields that describe an entity. A ticket carries: id, customer, category, priority, status, created_at, assigned_agent, responses. Clarity about which are required, which optional, which mutable, and which immutable shapes how operations interact with the entity.

This is domain modeling: making explicit what exists in the system and what constraints apply. It resembles database schema design but extends further—it defines not just data shape but which state combinations are valid and which attributes matter for operations.

Define Operations

Operations are the verbs in the language—the actions that can be performed on entities.

Each operation can be described across several dimensions:

Name should be domain-appropriate. categorize_ticket is clearer than update_ticket_field_3. Names establish the vocabulary agents use to express intent.

Inputs specify what the operation receives. Type matters. assign_agent(ticket: Ticket, agent: Agent) is more constrained than assign_agent(ticket_id, agent_id) because types are explicit.

Outputs describe what the operation returns. Some operations return the modified entity. Others return success/failure indicators or newly created entities.

Preconditions define what must be true before the operation can execute. resolve_ticket requires the ticket to be in a resolvable state. assign_agent requires the agent to have available capacity. Preconditions allow operations to fail fast with clear error messages rather than silently succeeding or partially completing.

Effects enumerate what changes. assign_agent modifies the ticket's assigned_agent field and updates the agent's workload. Explicit effects prevent hidden side effects and make system behavior predictable.

Errors are the distinct failure cases. The agent doesn't exist. The ticket is already resolved. The category isn't valid. Named error cases are more useful than generic failures because they let the agent respond intelligently to different problems.

Here's what this looks like for several operations:

categorize_ticket(ticket: Ticket, category: Category) -> Ticket
  preconditions:
    - ticket.status in [new, triaged]
    - category in valid_categories
  effects:
    - ticket.category = category
    - if ticket.status == new: ticket.status = triaged
  errors:
    - INVALID_STATE: ticket is not in a categorizable state
    - INVALID_CATEGORY: category is not in the valid set

assign_agent(ticket: Ticket, agent: Agent) -> Ticket
  preconditions:
    - ticket.status in [triaged, in_progress]
    - agent.current_load < agent.capacity
  effects:
    - ticket.assigned_agent = agent
    - ticket.status = in_progress
    - agent.current_load += 1
  errors:
    - INVALID_STATE: ticket cannot be assigned in current state
    - AGENT_AT_CAPACITY: agent has no capacity

escalate(ticket: Ticket, reason: string, target: EscalationTarget) -> Ticket
  preconditions:
    - ticket.status == in_progress
    - reason.length >= 20
    - target in valid_escalation_targets_for(ticket.category)
  effects:
    - ticket.status = escalated
    - ticket.escalation_reason = reason
    - ticket.escalation_target = target
  errors:
    - INVALID_STATE: ticket is not in progress
    - REASON_TOO_SHORT: escalation reason must be at least 20 characters
    - INVALID_TARGET: target is not valid for this ticket category

Notice how much is explicit here. The preconditions tell you when an operation is valid. The effects tell you exactly what changes. The errors are enumerated, not generic exceptions.

This explicitness is the point. When the agent tries to execute an operation that violates a precondition, you know exactly what went wrong. When you need to debug why something failed, you have specific error cases to examine.

Design for Composition

Agents rarely call single operations. They compose sequences to accomplish workflows. This composition creates design constraints.

Temporal dependencies affect what sequences make sense. If categorize_ticket advances a ticket to the triaged state and assign_agent requires triaged or in_progress, the natural sequence works. If states don't align—if one operation produces state A but the next requires state C—operations become incomposable. State transitions shape what sequences are possible.

Partial failure is an inevitable reality. The agent calls categorize_ticket successfully, then assign_agent fails because the agent has no capacity. The ticket is now categorized but unassigned. How should the system respond?

Different approaches handle this differently:

Each approach has tradeoffs. What matters is that the choice is explicit, not emergent.

Conditionals and control flow determine the language's expressiveness. Agents need to express "if this, then that"—making decisions based on intermediate results. A language that only supports linear sequences limits what agents can accomplish.

ticket = get_ticket(ticket_id)
category = classify(ticket.content, options=[billing, technical, account, other])

if category == unknown:
    escalate(ticket, reason="Unable to classify", target=human_review)
else:
    categorize_ticket(ticket, category)

    if ticket.customer.lifetime_value > 50000:
        priority = high
    else:
        priority = assess_priority(ticket)

    set_priority(ticket, priority)

This is real control flow: the agent makes decisions based on intermediate results. A language supporting this enables agents to handle variations in workflows rather than forcing linear execution.

Build the Runtime

The runtime is the validation layer that sits between agent code and the actual system. This layer makes the language safe.

Schema validation checks inputs before execution. A ticket ID should match the ID format. A category should be in the valid set. Malformed inputs are rejected immediately.

Precondition checking verifies that operation preconditions hold. Is the ticket in the right state? Does the agent have capacity? These checks run before any state changes happen.

Effect application executes the operation if checks pass. Entities are updated. State changes. This is where the actual system is modified.

Invariant enforcement happens after effects are applied. The system verifies that global constraints still hold: every ticket has a customer, resolved tickets have resolution notes, and so on. Invariants catch problems that individual operation preconditions might miss.

Error handling structures failures for agent interpretation. Rather than "operation failed," errors are specific: INVALID_STATE, AGENT_AT_CAPACITY, INVALID_CATEGORY. This specificity lets agents handle different failures appropriately.

The runtime should be simple and auditable. This is the component requiring trust. Keeping it small enough to inspect completely and testing every precondition and invariant makes that trust justified.

Implement Judgment Operations

Some operations require the agent to make a judgment rather than execute mechanical steps.

classify(content, options) translates unstructured content into a category. assess_priority(ticket) evaluates urgency. These operations produce value from fuzzy reasoning but need structure around them.

Judgment operations have distinctive properties that shape how they're implemented:

Constrained outputs keep judgment fuzzy but structure discrete. classify returns one of specified options (or unknown), not arbitrary values. The structure constrains outputs even though the decision-making process is probabilistic.

Confidence signals uncertainty. Including confidence scores—classify(content, options) -> (option, confidence)—lets the system respond appropriately to uncertainty. Low confidence can trigger escalation or additional review.

Logging enables improvement. Recorded judgments capture input, decision, and confidence. This becomes the data for evaluation and iterative refinement.

Ground truth evaluation measures accuracy. Sampling judgments and having humans evaluate them (Was this categorization correct? Was this priority appropriate?) works only because outputs are discrete and logged.

Judgment operations are where agents add value—translating fuzzy inputs into structured decisions. They're also where errors originate. The structure around them—constrained outputs, confidence signals, logging, and evaluation mechanisms—transforms errors from catastrophic to manageable.

Handle the Unknown

Categories are always incomplete. State machines encounter cases they weren't designed for. Reality resists neat categorization.

Effective languages accommodate this uncertainty structurally.

UNKNOWN as a legitimate value allows agents to indicate uncertainty without failing. classify can return unknown. assess_priority can return needs_review. These aren't failures but accurate assessments of uncertainty.

Escalation as a primitive gives agents a safe option. escalate(entity, reason) should be available for any entity type. The agent has a way to say "I can't confidently handle this; human review is needed."

Explicit routing connects unknowns to human processes. When something hits unknown or escalates, it should have a defined destination, reviewer, and SLA. Otherwise, unknowns pile up unprocessed.

Unknown as signal repurposes uncertainty for improvement. Tracking what hits unknown reveals gaps: clusters of similar cases indicate categories that don't fit reality. Patterns in escalations show where the language breaks. This data drives evolution.

The goal isn't full automation. It's automating what the system can handle reliably while routing everything else clearly to humans.

Evolve the Language

First languages are rarely right. The goal is designing for evolution rather than perfection. Schema migration and versioning become first-class design considerations.

Incremental expansion learns from usage. Start with the minimum needed for one workflow. Once working, add operations for the next workflow. Each addition teaches something about actual patterns and problems.

Careful enum expansion prevents breaking changes. Adding a new category or status affects everything handling those values. Existing code should handle new values sensibly—often by routing to unknown until explicitly updated—rather than failing.

Graceful deprecation manages transitions. When operations or values need to go away, mark them deprecated before removal. Logging usage shows what's still depending on them and informs deprecation timeline.

Language versioning enables compatibility. Changes to operation signatures or validation logic should be versioned. Scripts can be tagged with language version, making behavior explicit rather than implicit.

Backward compatibility prevents silent failures. Scripts written against older language versions should either continue working or fail clearly. Silent behavior changes generate mysterious production failures.

Language design is product design, with the agent as user. Like any product, evolution follows actual usage patterns and discovered problems.

Anti-Patterns to Avoid

Generic CRUD operations bypass structure. update_ticket(ticket, fields) with arbitrary field dictionaries bypasses validation logic. Agents can set any field to any value. The type system becomes moot.

String typing defers error detection. Everything is a string parsed at runtime rather than a constrained enum. Categories might or might not be valid. Errors move from design time to runtime where they're harder to catch and debug.

Implicit state transitions hide logic. Operations change state as side effects without explicit documentation. The ticket silently moves from new to in_progress somewhere in the implementation. State transitions should be visible and declared.

Missing preconditions delegate validation downstream. Operations callable in any state, with downstream code handling invalid cases. Preconditions checked upfront catch invalid operations immediately.

Catch-all error handling obscures failure. Each operation returns generic success or error with no detail. When things fail, the cause is opaque. Specific, enumerated error cases enable appropriate agent responses.

Unbounded judgment outputs eliminate structure. Judgment operations returning arbitrary strings instead of constrained categories lose the ability to validate, measure, and route based on outputs.

No escape hatch assumes complete automation. Languages expressing every possible workflow without escalation or unknown paths are fantasies. Every system encounters unknowns. Escalation must be built in from the start.

Implementation Approaches

DSL runtimes can be built several ways, each with tradeoffs:

JSON Schema + Validation Library defines operations as JSON schemas with validation libraries (Zod, JSON Schema validators) checking inputs at runtime. Operations execute as function calls with validated inputs. This approach is straightforward and suits simpler domains.

State Machine Libraries (XState, state_machine) fit domains where state transitions are central. Valid states and transitions are declared. The library enforces that only valid transitions can occur. This adds specificity at the cost of more framework constraints.

Code Generation starts with a schema (YAML, JSON, custom syntax) and generates validation code, type definitions, and client libraries. Upfront complexity is higher but consistency as the language evolves is more assured.

Embedded DSL leverages the host language's type system. In TypeScript, typed functions with discriminated unions let the compiler enforce many constraints at build time. This is powerful but requires matching the team's existing technology stack.

The right approach depends on domain complexity and team expertise. Starting simple—JSON Schema + validation library—is often practical. Sophistication can be added later if needed.

Discovering the Domain

Domain modeling assumes domain understanding. But perfect understanding comes through operation, not before.

Domain discovery is inherently iterative. Full design before operation is impossible, but some initial structure is necessary.

Expert engagement surfaces domain knowledge. Domain experts walking through specific cases—"When this happens, what are you checking? What could go wrong?"—reveal the nouns (entities) and verbs (operations) that matter. Multiple conversations reveal patterns.

Happy path modeling provides a starting point. Designing the most common, straightforward case first allows edge cases to emerge through operation. The structure evolves with each discovery.

UNKNOWN as exploratory tool identifies gaps. When unsure if a category is correct, routing to UNKNOWN is cheap. High UNKNOWN rates reveal where the model doesn't fit reality.

Expectation of revision normalizes iteration. First domain models contain errors: too-broad categories, missing states, operations needing different preconditions. This is normal, not a design failure. The goal is building a structure that evolves as understanding deepens.

For more on iterative learning through deployment, see The Boundary Model.

A Complete Worked Example

Let's build a complete DSL for a simple expense approval system.

Entities:

Expense:
  id: ExpenseId
  employee: EmployeeId
  amount: Money
  category: ExpenseCategory
  status: ExpenseStatus
  description: string
  receipt_url: string | null
  submitted_at: DateTime
  decided_at: DateTime | null
  decided_by: UserId | null
  decision_reason: string | null

ExpenseCategory = travel | meals | supplies | software | other
ExpenseStatus = draft | submitted | approved | rejected | paid
Money = { amount: number, currency: Currency }
Currency = USD | EUR | GBP

Operations:

submit_expense(expense: Expense) -> Expense
  preconditions:
    - expense.status == draft
    - expense.amount.amount > 0
    - expense.description.length >= 10
    - expense.category != null
  effects:
    - expense.status = submitted
    - expense.submitted_at = now()
  errors:
    - INVALID_STATUS: expense is not in draft status
    - INVALID_AMOUNT: amount must be positive
    - MISSING_DESCRIPTION: description must be at least 10 characters
    - MISSING_CATEGORY: category is required

auto_approve(expense: Expense, approver: UserId) -> Expense
  preconditions:
    - expense.status == submitted
    - expense.amount.amount <= 100
    - expense.category in [meals, supplies]
  effects:
    - expense.status = approved
    - expense.decided_at = now()
    - expense.decided_by = approver
    - expense.decision_reason = "Auto-approved: under $100 threshold"
  errors:
    - INVALID_STATUS: expense is not submitted
    - AMOUNT_TOO_HIGH: amount exceeds auto-approval threshold
    - CATEGORY_NOT_ELIGIBLE: category requires manual review

request_review(expense: Expense, reason: string) -> Expense
  preconditions:
    - expense.status == submitted
    - reason.length >= 20
  effects:
    - expense.status = submitted  # stays submitted, but flagged
    - expense.decision_reason = reason
  errors:
    - INVALID_STATUS: expense is not submitted
    - REASON_TOO_SHORT: reason must be at least 20 characters

approve(expense: Expense, approver: UserId, reason: string) -> Expense
  preconditions:
    - expense.status == submitted
    - reason.length >= 10
  effects:
    - expense.status = approved
    - expense.decided_at = now()
    - expense.decided_by = approver
    - expense.decision_reason = reason
  errors:
    - INVALID_STATUS: expense is not submitted
    - REASON_TOO_SHORT: reason must be at least 10 characters

reject(expense: Expense, approver: UserId, reason: string) -> Expense
  preconditions:
    - expense.status == submitted
    - reason.length >= 20
  effects:
    - expense.status = rejected
    - expense.decided_at = now()
    - expense.decided_by = approver
    - expense.decision_reason = reason
  errors:
    - INVALID_STATUS: expense is not submitted
    - REASON_TOO_SHORT: rejection reason must be at least 20 characters

escalate(expense: Expense, reason: string) -> void
  preconditions:
    - expense.status == submitted
    - reason.length >= 20
  effects:
    - routes to human review queue with reason
  errors:
    - INVALID_STATUS: expense is not submitted
    - REASON_TOO_SHORT: escalation reason must be at least 20 characters

Judgment operation:

assess_approval(expense: Expense) -> ApprovalDecision
  inputs:
    - expense.amount
    - expense.category
    - expense.description
    - expense.receipt_url (if present)
    - employee's expense history (last 90 days)
  outputs:
    ApprovalDecision = approve | reject | escalate | unknown
  constraints:
    - must return one of the four options
    - confidence score included
    - explanation required

Example workflow:

expense = get_expense(expense_id)

if expense.status != submitted:
    return error("Expense not ready for review")

# Try auto-approval first
if expense.amount.amount <= 100 and expense.category in [meals, supplies]:
    return auto_approve(expense, system_user_id)

# Agent makes judgment
decision = assess_approval(expense)

match decision.value:
    case "approve":
        return approve(expense, agent_user_id, decision.explanation)
    case "reject":
        return reject(expense, agent_user_id, decision.explanation)
    case "escalate":
        return escalate(expense, decision.explanation)
    case "unknown":
        return escalate(expense, "Agent uncertain: " + decision.explanation)

This is a complete, working DSL. Every operation is explicit. Every error is enumerated. The judgment operation has constrained outputs. There's an escape hatch. The workflow composes operations in a clear sequence.

A Minimal Starting Point

If this all seems like a lot, here's the smallest useful language:

classify(content: string, options: list[Category]) -> Category | unknown
escalate(entity_id: string, reason: string) -> void

Two operations. The agent can classify things into categories you specify, or escalate to humans with an explanation.

This is enough to automate triage. Yes, classification alone could be handled by existing NLU packages - that's the point. You're starting simple so you can focus on building the machinery around it: the evaluation loops, the escalation handling, the operational processes. The capability will grow; the foundation needs to be solid first. Tickets come in, the agent classifies them, unclassifiable ones go to humans. You can measure accuracy. You can route based on category. You can evolve from here.

Add operations as you need them:

assign(ticket: Ticket, agent: Agent) -> Ticket
respond(ticket: Ticket, message: string) -> Ticket
resolve(ticket: Ticket, resolution_note: string) -> Ticket

Each addition is a deliberate expansion of what the agent can do. Each comes with its validation logic, its preconditions, its error cases.

The language grows incrementally. It stays safe because you're defining each piece explicitly.

Conclusion

Designing a domain-specific language for agents is designing the space of what's possible.

A good language makes the common workflows easy to express, makes invalid operations impossible, makes errors visible and specific, and makes the whole thing observable and measurable.

This is real work. It requires understanding your domain deeply. It requires making decisions you might have avoided. It requires thinking about edge cases and failure modes.

But it's the work that makes reliable agent automation possible. Without it, you're asking agents to operate in a space with no guardrails - and you'll get exactly the unpredictable behavior that implies.

The models are capable enough to write programs in your language. The question is whether you've defined a language worth programming in.


This essay is part of a series on building reliable AI agent systems.

Overview: The Structure Problem

Previous: Long-Running Agents

Next: The Boundary Model — Incremental deployment playbook