AgentOps Forensics: How to Debug the Decisions You Can't See
Agent chains fail in ways that unit tests can't catch. Learn how to trace tool choice, memory, and reasoning failures — and attribute every token dollar to the call that spent it.
AgentOps Forensics: How to Debug the Decisions You Can’t See
June 2026 · nokhiz.github.io
TL;DR — 6 Central Insights ⚡
| # | Insight |
|---|---|
| 1 | Agent failures are rarely in the code — they live in the reasoning trace: the sequence of tool calls, memory reads, and inferences that produced a wrong answer |
| 2 | Three root causes cover 90% of agent bugs: wrong tool selection, stale or missing memory, and unconstrained reasoning loops |
| 3 | A failed agent run that costs $50 in tokens looks identical to a successful one from the outside — cost attribution requires trace-level instrumentation |
| 4 | ”Works on my machine” for agents means: passed on the eval dataset, failed on a live user’s context window with a different memory state |
| 5 | OpenTelemetry-compatible tracing (LangSmith, Langfuse, Phoenix) is the current standard — ship spans, not print statements |
| 6 | Every agent span should carry: run ID, tool called, input/output tokens, latency, cost, and the reasoning step that triggered it |
1. The Problem Nobody Warned You About 🔍
💡 Key Message: A traditional application either works or throws an exception. An agent can do neither — it can spend 20 API calls reasoning its way to a plausible-sounding wrong answer, and your logs will show nothing except a 200 OK.
Software debugging has a fundamental assumption baked in: when something goes wrong, the system produces a visible artifact. An error. A stack trace. A failed assertion. Somewhere, something signals that reality diverged from expectation.
Agents break that contract.
An LLM agent can call the wrong tool, read from a stale memory store, spin through a reasoning loop that reaches a confident but incorrect conclusion — and produce output that looks entirely reasonable to a downstream system. The response returns 200. The cost invoice arrives at the end of the month. Nobody knows what happened in between.
The old developer joke is “works on my machine.” With agents, the new version is: the agent spent $500 learning nothing, and you only found out when the downstream workflow silently produced wrong answers for two weeks.
1. Thought Experiment 🤔 Same Prompt, Different Failures
Imagine an agent tasked with: “Summarize the last three support tickets for customer Acme Corp and draft a follow-up email.”
On Tuesday, it works. On Wednesday, same prompt, same code:
- It calls the wrong tool —
search_ticketsinstead ofget_tickets_by_customer— and retrieves unrelated tickets from a different customer - It uses a cached memory entry that refers to a conversation from two weeks ago
- It reasons through five steps, determines the summary is complete, and sends a draft email referencing support issues Acme Corp never had
The code didn’t change. The model didn’t change. What changed: the memory state, the tool routing, and a slightly different context window composition. This is the class of bug that tracing was built to catch.
2. The Three Suspects 🕵️
💡 Key Message: Before you can debug an agent, you need a taxonomy. Most failures trace back to one of three root causes — and each requires a different investigative tool.
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e2433', 'primaryTextColor': '#f0f4ff', 'primaryBorderColor': '#3a4460', 'lineColor': '#6b7fa3', 'secondaryColor': '#252d3d', 'tertiaryColor': '#1a2030', 'background': '#161c2d', 'mainBkg': '#1e2433', 'nodeBorder': '#3a4460', 'clusterBkg': '#252d3d', 'titleColor': '#f0f4ff', 'edgeLabelBackground': '#252d3d', 'fontFamily': 'monospace'}}}%%
flowchart TD
F["⚠️ Agent Failure\n200 OK · wrong output · no error"]
F --> A["🛠️ Suspect A\nTool Choice"]
F --> B["🧠 Suspect B\nMemory"]
F --> C["💭 Suspect C\nReasoning"]
A --> A1["wrong tool selected\nor wrong parameters"]
A --> A2["tool called in a loop\nno termination signal"]
B --> B1["stale vector store\ncache not refreshed"]
B --> B2["retrieval miss\nno exception thrown"]
C --> C1["chain-of-thought drift\nwrong subgoal adopted"]
C --> C2["correct tool calls\nwrong conclusion"]
style F fill:#3b1f1f,stroke:#7a3a3a,color:#f0f4ff
style A fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
style B fill:#1e2433,stroke:#3a4460,color:#f0f4ff
style C fill:#2a1f3d,stroke:#5a3a7a,color:#f0f4ff
style A1 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
style A2 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
style B1 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
style B2 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
style C1 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
style C2 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
1. Suspect A 🛠️ Tool Choice
The agent chose the wrong tool, used the right tool with the wrong parameters, or invoked a tool at the wrong point in its reasoning sequence.
This is the most common failure mode. Tool selection happens inside the model’s reasoning — it is not deterministic, and it is sensitive to:
- How tools are described in the system prompt (name, description, parameter schema)
- What the model has “seen” earlier in the context window
- Whether the model has been given too many tools and is guessing
| Symptom | Likely Cause |
|---|---|
| Correct answer, wrong data source | Tool name collision or ambiguous descriptions |
| Right tool, wrong parameters | Missing examples in tool schema |
| Tool called 8 times in a loop | No termination condition, missing is_done signal |
2. Suspect B 🧠 Memory
The agent read from a stale cache, failed to retrieve relevant episodic context, or had its context window polluted by irrelevant memory entries.
Memory failures are subtle because the agent doesn’t know what it doesn’t know. A retrieval miss doesn’t throw an exception — the model simply reasons without the missing context and produces an answer based on what it has.
📌 Example: A customer success agent retrieves the top-5 most-similar past conversations by embedding similarity. If the vector store hasn’t been updated since yesterday’s batch job, the agent operates on stale state — confidently, and without warning.
3. Suspect C 💭 Reasoning
The agent’s chain-of-thought drifted. It developed a plausible but incorrect subgoal, then optimized for that subgoal instead of the original task. Or it was given a goal with enough ambiguity that multiple reasonable interpretations existed — and it picked the wrong one.
Reasoning failures are the hardest to catch because they happen inside the model, not in the tool calls. The trace will show correct tool invocations; the problem is in what the model decided to do with the results.
✏️ Key Rule: If the tool calls look right but the output is wrong, you are looking at a reasoning failure — not a tooling problem. Switch from trace inspection to prompt and context analysis.
3. Anatomy of a Failed Run: A Real Trace 🔬
🔬 Research: The following trace is composite but representative — built from real failure patterns seen in production agent deployments using LangSmith and Langfuse trace exports.
1. Scene 📋 The Setup
Agent: Support triage agent. Goal: classify incoming ticket, retrieve similar past tickets, draft initial response.
Tools available: classify_ticket, search_tickets, get_customer_history, draft_response
Memory: Customer conversation history via vector store (updated nightly at 02:00 UTC)
Invocation time: 06:47 UTC (4+ hours after memory was last refreshed)
2. Failure 💥 What the Trace Revealed
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e2433', 'primaryTextColor': '#f0f4ff', 'primaryBorderColor': '#3a4460', 'lineColor': '#6b7fa3', 'secondaryColor': '#252d3d', 'tertiaryColor': '#1a2030', 'background': '#161c2d', 'mainBkg': '#1e2433', 'nodeBorder': '#3a4460', 'clusterBkg': '#252d3d', 'titleColor': '#f0f4ff', 'edgeLabelBackground': '#252d3d', 'fontFamily': 'monospace'}}}%%
sequenceDiagram
participant U as 👤 User
participant A as 🤖 Agent
participant T as 🛠️ Tools
participant M as 🧠 Memory
U->>A: Billing discrepancy, Acme Corp
Note over A,T: Step 1: classify
A->>T: classify_ticket(text)
T-->>A: billing / medium
Note over A,T: Step 2: search tickets
A->>T: search_tickets(query)
T-->>A: 5 tickets — wrong customer ❌
Note over A,M: Step 3: customer history
A->>M: get_customer_history(acme)
M-->>A: stale — 18 days old ❌
Note over A,M: Step 4: reasoning
A->>A: matches #4421 — recurrence
Note over A,T: Step 5: draft response
A->>T: draft_response(ref=#4421)
T-->>A: draft with wrong reference ❌
A-->>U: Response sent ❌
Three failures in sequence:
search_ticketsreturned tickets without filtering by customer — wrong tool parameters, no customer ID passedget_customer_historyreturned an 18-day-old cache entry — memory staleness not surfaced in the response- The model’s reasoning connected these mismatched results into a coherent (but wrong) narrative
Total tokens: 4,100. Total cost: $0.14. Time elapsed: 3.2 seconds. Zero errors in logs.
4. Cost Attribution: Who Paid for What? 💸
💡 Key Message: Without span-level cost tagging, your monthly LLM invoice is a black box. You know you spent $3,200 — you don’t know that $800 of it came from one agent’s retry loop on a broken tool.
1. Token-Level Attribution 📊
Every span in an agent trace should carry token counts — not just total tokens per run, but input vs. output tokens per step. The ratio tells you where cost is accumulating:
- High input tokens in the reasoning step — context window is being loaded with too much irrelevant memory
- High output tokens in tool-calling steps — the model is generating verbose tool call arguments; schema needs tightening
- Repeated high-cost steps — a loop or retry pattern consuming tokens on a stuck decision
2. Call-Level Cost Breakdown 💰
| Agent Step | Model | Input Tokens | Output Tokens | Cost |
|---|---|---|---|---|
| classify_ticket | claude-haiku-4-5 | 312 | 48 | $0.002 |
| search_tickets | claude-haiku-4-5 | 580 | 210 | $0.009 |
| get_customer_history | claude-sonnet-4-6 | 1,240 | 180 | $0.054 |
| Reasoning step | claude-sonnet-4-6 | 1,480 | 520 | $0.068 |
| draft_response | claude-sonnet-4-6 | 890 | 420 | $0.041 |
| Total | — | 4,502 | 1,378 | $0.174 |
The reasoning step and customer history retrieval together account for 70% of the cost — and both consumed inflated input tokens because the context window carried stale, irrelevant memory entries. Fix the memory staleness, and you likely cut this run’s cost in half.
5. Building a Forensics Stack 🏗️
💡 Key Message: You don’t need a custom observability platform — you need consistent span instrumentation and a place to query it. The tooling already exists.
1. Tracing 🔭 Capturing the Decision Trail
The current standard is OpenTelemetry-compatible spans, with LLM-specific semantic conventions emerging under the OpenTelemetry GenAI working group. Three platforms support this today without significant vendor lock-in:
- Langfuse — open-source, self-hostable, strong cost tracking per trace. Best choice for teams that want data sovereignty
- LangSmith — tight LangChain integration, good dataset/eval tooling. Best for teams already in the LangChain ecosystem
- Arize Phoenix — strong on eval and drift detection, open-source. Best when you’re running evals alongside production traces
Instrument at the agent framework level, not per-tool. One decorator wraps the entire run; individual tool calls become child spans automatically.
2. Structured Logging 📝 Making Traces Queryable
Raw spans are not enough — they tell you what happened, not why. Add structured metadata to every span:
{
"run_id": "run_9f2a1c",
"agent_step": "tool_call",
"tool_name": "search_tickets",
"reasoning_trigger": "find similar billing issues",
"input_tokens": 580,
"output_tokens": 210,
"latency_ms": 840,
"cost_usd": 0.009,
"memory_age_seconds": 65340,
"customer_id": "acme_corp"
}
memory_age_seconds is the field that would have immediately surfaced the 18-day-old cache entry in the failed run above. Add it wherever memory is retrieved.
3. Cost Tagging 🏷️ Attribution at the Call Level
Tag every LLM call with a cost center before it reaches your billing dashboard:
- Agent name — which agent type initiated this run
- Trigger source — user-initiated, scheduled, webhook, internal
- Customer tier — if serving B2B, attribute cost to the customer being served
- Task category — billing, support, summarization, etc.
Without these tags, cost spikes are un-debuggable. With them, you can answer: which agent, for which customer category, triggered by which event type, is responsible for the $800 spike on Tuesday.
6. What a Good Trace Looks Like 🎯
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e2433', 'primaryTextColor': '#f0f4ff', 'primaryBorderColor': '#3a4460', 'lineColor': '#6b7fa3', 'secondaryColor': '#252d3d', 'tertiaryColor': '#1a2030', 'background': '#161c2d', 'mainBkg': '#1e2433', 'nodeBorder': '#3a4460', 'clusterBkg': '#252d3d', 'titleColor': '#f0f4ff', 'edgeLabelBackground': '#252d3d', 'fontFamily': 'monospace'}}}%%
graph TD
RUN["🏃 Run: run_9f2a1c
agent=support-triage
trigger=webhook
customer_tier=enterprise"]
S1["📋 Span: classify_ticket
tokens_in=312 tokens_out=48
cost=$0.002 latency=210ms
result=billing/medium"]
S2["🔍 Span: search_tickets
tokens_in=580 tokens_out=210
cost=$0.009 latency=840ms
customer_id=acme_corp ✅"]
S3["🧠 Span: get_customer_history
tokens_in=1240 tokens_out=180
cost=$0.054 latency=620ms
memory_age_seconds=43 ✅"]
S4["💭 Span: reasoning
tokens_in=1480 tokens_out=520
cost=$0.068 latency=1240ms
decision=draft_from_ticket_#4380"]
S5["✉️ Span: draft_response
tokens_in=890 tokens_out=420
cost=$0.041 latency=980ms
reference_ticket=#4380 ✅"]
RUN --> S1
S1 --> S2
S2 --> S3
S3 --> S4
S4 --> S5
style RUN fill:#2a1f3d,stroke:#5a3a7a,color:#f0f4ff
style S1 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
style S2 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
style S3 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
style S4 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
style S5 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
Every span carries cost, tokens, and latency. Customer ID is propagated through tool calls. Memory age is explicit. The reasoning step shows which ticket it decided to reference — making the output auditable without re-running the agent.
7. The Takeaway 🚀
💡 Key Message: Agent observability is not optional once you have agents in production. It is the difference between a system you can improve and a system you can only restart and hope for the best.
The three questions every agent deployment needs to answer:
- What did the agent decide, and why? — answered by reasoning traces
- Where did it go wrong? — answered by span-level inputs and outputs compared against expected behavior
- What did it cost, and who should it be attributed to? — answered by cost-tagged spans
The tooling exists. Langfuse, LangSmith, and Phoenix all provide what you need without significant overhead. The gap is instrumentation discipline — making it a first-class concern at build time, not a retrofit after the first $500 mystery invoice.
You can’t debug what you can’t see. Agents make invisible decisions by default. Make them visible.
Tags: ai · agents · observability · agentops