Why Debugging AI Agents Is a Nightmare
by Ben Letson & Arnon Santos
Why This Matters for Your Business
Agents are here to stay and are a key part in making work more efficient. We use a host of agents to help us get things done, but they don’t always work as expected:
- Debugging agents is a new kind of problem because LLMs are stochastic
- Making agents reliable for your coworkers and users can take longer than building them
- Using some old-school statistical modeling is helping us fix our agents faster
The Debugging Nightmare
Let me describe what debugging an agent actually looks like.
Your agent is supposed to process a meeting transcript and create tasks in your project management system. It works 80% of the time. The other 20%, something goes wrong:
- Sometimes it misses action items
- Sometimes it creates duplicate tasks
- Sometimes it assigns tasks to the wrong people
- Sometimes it just... stops
You look at the logs. The agent made a series of decisions. Each decision seemed locally reasonable. But somewhere in the chain, things went sideways.
Which decision was wrong? Hard to say. The decision that looks wrong might have been a reasonable response to an earlier mistake. Or the earlier decision was fine, but the tool it called returned unexpected data. Or the user gave ambiguous input that the agent interpreted differently than you expected.
Can you reproduce it? Maybe. LLMs are stochastic. Run the same input twice, get different outputs. The bug you're hunting might not manifest on the next run.
Where do you even start? The agent touched six different systems, made four decisions, retried two tool calls, and asked the user one clarifying question. The failure could have originated at any of those points.
This is the reality of agentic development, and it gets worse as agents get more complex.
The Hidden Structure: Agents Are Markov Processes
My background is in mathematical modeling, and I’ve seen similar dynamics before in my travels. Unlike normal software (deterministic, if-then style logic), agents are inherently stochastic and need to be understood as Markov Processes.
A Markov Process is a mathematical tool meant to model a system as it moves between states, where the probability of the next state depends only on the current state. Imagine being on a ladder and flipping a coin - if it lands Heads, you go up a rung; Tails, down a rung. That’s a (very simple) Markov Process and each rung of the ladder is a state.
Agent executions have this structure:
- The agent is always in some state: waiting for input, making a decision, calling a tool, waiting for a user response
- From each state, there are a finite number of possible next states
- The transition between states is probabilistic: the LLM might choose different branches, tools might succeed or fail, users might respond helpfully or unhelpfully
Once you see agents this way, a whole toolkit of analysis becomes available.
A Framework for Agent Analysis
We've built a framework that converts agent architectures into parameterized Markov chains. Here's how it works.
Three Types of States
Every state in an agent execution falls into one of three categories:
Decision Nodes: The LLM chooses between options. Should I ask for clarification or proceed? Should I use tool A or tool B? Is the extraction quality good enough?
Execution Nodes: The agent interacts with an external system. Call an API. Run a database query. Invoke a tool. These can succeed or fail.
Transmission Nodes: The agent interacts with a human (or another agent). Ask for clarification. Request approval. Deliver results. The human might respond helpfully, or not.
We call it the DET model and it’s helping us keep track of the agents we’re building in house and with our clients.
From Architecture to Analysis
Given an agent's process diagram and LLM selection, we can build a complete model of how the agent behaves probabilistically. Each state has defined transitions to other states, each with an associated probability.
For a real agent, say one that processes meeting transcripts into structured plans, you might have 50+ states. The model captures every possible path through the system, including retries, fallbacks, and failure modes.
This model is now something we can analyze systematically.
What the Analysis Tells You
Once you have the model, you can answer questions that are otherwise unanswerable:
Probability of Success vs. Failure
What's the probability that an agent run completes successfully vs. fails catastrophically? This isn't a guess. It's a calculation. Given your architecture and parameter estimates, the math tells you your expected success rate.
More importantly, it tells you how that rate changes as parameters shift. What happens if your tool reliability drops from 99% to 95%? What if user response quality degrades? You can compute the sensitivity.
Expected Steps and Cost
How many steps does a typical successful run take? How many tool calls? How many user interactions? These affect latency and cost. The analysis gives you expected values and variances.
Bottleneck Identification
Which states contribute most to failure probability? The analysis identifies architectural bottlenecks: decision points where wrong choices cascade, tool calls where failures are catastrophic, user interactions that derail the process.
Architecture Comparison
Considering two different agent designs? Model both. Compare their reliability profiles under the same parameter assumptions. Make architectural decisions based on analysis, not intuition.
What This Doesn't Do
Let me be clear about scope. This framework is a systems debugging and design tool, not a theory of intelligence.
It doesn't model:
- What the agent is "thinking"
- How the agent learns or improves
- Optimal decision-making
It does model:
- How failures propagate through agent architectures
- Where architectural fragility lives
- How parameter changes affect reliability
The Markov assumption is explicitly approximate. We're saying: "Given a sufficiently compressed state representation, the next step is conditionally independent of full history." When that assumption is violated, it indicates missing state or architectural fragility, which is itself useful diagnostic information.
Why We Built This
We got tired of debugging agents by staring at logs.
Every agent failure investigation was the same: trace through the execution, make guesses about what went wrong, try to reproduce, fail to reproduce, make changes based on intuition, hope it helps.
This framework gives us something better: a principled way to reason about agent reliability before deployment, identify architectural weaknesses quantitatively, and make targeted improvements based on analysis rather than hunches.
Building agents and want to debug them systematically? Get in touch →