AgentOps Forensics: How to Debug the Decisions You Can’t See

June 2026 · nokhiz.github.io

TL;DR — 6 Central Insights ⚡

#	Insight
1	Agent failures are rarely in the code — they live in the reasoning trace: the sequence of tool calls, memory reads, and inferences that produced a wrong answer
2	Three root causes cover 90% of agent bugs: wrong tool selection, stale or missing memory, and unconstrained reasoning loops
3	A failed agent run that costs $50 in tokens looks identical to a successful one from the outside — cost attribution requires trace-level instrumentation
4	”Works on my machine” for agents means: passed on the eval dataset, failed on a live user’s context window with a different memory state
5	OpenTelemetry-compatible tracing (LangSmith, Langfuse, Phoenix) is the current standard — ship spans, not print statements
6	Every agent span should carry: run ID, tool called, input/output tokens, latency, cost, and the reasoning step that triggered it

1. The Problem Nobody Warned You About 🔍

💡 Key Message: A traditional application either works or throws an exception. An agent can do neither — it can spend 20 API calls reasoning its way to a plausible-sounding wrong answer, and your logs will show nothing except a 200 OK.

Software debugging has a fundamental assumption baked in: when something goes wrong, the system produces a visible artifact. An error. A stack trace. A failed assertion. Somewhere, something signals that reality diverged from expectation.

Agents break that contract.

An LLM agent can call the wrong tool, read from a stale memory store, spin through a reasoning loop that reaches a confident but incorrect conclusion — and produce output that looks entirely reasonable to a downstream system. The response returns 200. The cost invoice arrives at the end of the month. Nobody knows what happened in between.

The old developer joke is “works on my machine.” With agents, the new version is: the agent spent $500 learning nothing, and you only found out when the downstream workflow silently produced wrong answers for two weeks.

1. Thought Experiment 🤔 Same Prompt, Different Failures

Imagine an agent tasked with: “Summarize the last three support tickets for customer Acme Corp and draft a follow-up email.”

On Tuesday, it works. On Wednesday, same prompt, same code:

It calls the wrong tool — search_tickets instead of get_tickets_by_customer — and retrieves unrelated tickets from a different customer
It uses a cached memory entry that refers to a conversation from two weeks ago
It reasons through five steps, determines the summary is complete, and sends a draft email referencing support issues Acme Corp never had

The code didn’t change. The model didn’t change. What changed: the memory state, the tool routing, and a slightly different context window composition. This is the class of bug that tracing was built to catch.

2. The Three Suspects 🕵️

💡 Key Message: Before you can debug an agent, you need a taxonomy. Most failures trace back to one of three root causes — and each requires a different investigative tool.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e2433', 'primaryTextColor': '#f0f4ff', 'primaryBorderColor': '#3a4460', 'lineColor': '#6b7fa3', 'secondaryColor': '#252d3d', 'tertiaryColor': '#1a2030', 'background': '#161c2d', 'mainBkg': '#1e2433', 'nodeBorder': '#3a4460', 'clusterBkg': '#252d3d', 'titleColor': '#f0f4ff', 'edgeLabelBackground': '#252d3d', 'fontFamily': 'monospace'}}}%%
flowchart TD
    F["⚠️ Agent Failure\n200 OK · wrong output · no error"]

    F --> A["🛠️ Suspect A\nTool Choice"]
    F --> B["🧠 Suspect B\nMemory"]
    F --> C["💭 Suspect C\nReasoning"]

    A --> A1["wrong tool selected\nor wrong parameters"]
    A --> A2["tool called in a loop\nno termination signal"]

    B --> B1["stale vector store\ncache not refreshed"]
    B --> B2["retrieval miss\nno exception thrown"]

    C --> C1["chain-of-thought drift\nwrong subgoal adopted"]
    C --> C2["correct tool calls\nwrong conclusion"]

    style F fill:#3b1f1f,stroke:#7a3a3a,color:#f0f4ff
    style A fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
    style B fill:#1e2433,stroke:#3a4460,color:#f0f4ff
    style C fill:#2a1f3d,stroke:#5a3a7a,color:#f0f4ff
    style A1 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
    style A2 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
    style B1 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
    style B2 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
    style C1 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
    style C2 fill:#1e2433,stroke:#3a4460,color:#f0f4ff

1. Suspect A 🛠️ Tool Choice

The agent chose the wrong tool, used the right tool with the wrong parameters, or invoked a tool at the wrong point in its reasoning sequence.

This is the most common failure mode. Tool selection happens inside the model’s reasoning — it is not deterministic, and it is sensitive to:

How tools are described in the system prompt (name, description, parameter schema)
What the model has “seen” earlier in the context window
Whether the model has been given too many tools and is guessing

Symptom	Likely Cause
Correct answer, wrong data source	Tool name collision or ambiguous descriptions
Right tool, wrong parameters	Missing examples in tool schema
Tool called 8 times in a loop	No termination condition, missing `is_done` signal

2. Suspect B 🧠 Memory

The agent read from a stale cache, failed to retrieve relevant episodic context, or had its context window polluted by irrelevant memory entries.

Memory failures are subtle because the agent doesn’t know what it doesn’t know. A retrieval miss doesn’t throw an exception — the model simply reasons without the missing context and produces an answer based on what it has.

📌 Example: A customer success agent retrieves the top-5 most-similar past conversations by embedding similarity. If the vector store hasn’t been updated since yesterday’s batch job, the agent operates on stale state — confidently, and without warning.

3. Suspect C 💭 Reasoning

The agent’s chain-of-thought drifted. It developed a plausible but incorrect subgoal, then optimized for that subgoal instead of the original task. Or it was given a goal with enough ambiguity that multiple reasonable interpretations existed — and it picked the wrong one.

Reasoning failures are the hardest to catch because they happen inside the model, not in the tool calls. The trace will show correct tool invocations; the problem is in what the model decided to do with the results.

✏️ Key Rule: If the tool calls look right but the output is wrong, you are looking at a reasoning failure — not a tooling problem. Switch from trace inspection to prompt and context analysis.

3. Anatomy of a Failed Run: A Real Trace 🔬

🔬 Research: The following trace is composite but representative — built from real failure patterns seen in production agent deployments using LangSmith and Langfuse trace exports.

1. Scene 📋 The Setup

Agent: Support triage agent. Goal: classify incoming ticket, retrieve similar past tickets, draft initial response. Tools available: classify_ticket, search_tickets, get_customer_history, draft_response Memory: Customer conversation history via vector store (updated nightly at 02:00 UTC) Invocation time: 06:47 UTC (4+ hours after memory was last refreshed)

2. Failure 💥 What the Trace Revealed

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e2433', 'primaryTextColor': '#f0f4ff', 'primaryBorderColor': '#3a4460', 'lineColor': '#6b7fa3', 'secondaryColor': '#252d3d', 'tertiaryColor': '#1a2030', 'background': '#161c2d', 'mainBkg': '#1e2433', 'nodeBorder': '#3a4460', 'clusterBkg': '#252d3d', 'titleColor': '#f0f4ff', 'edgeLabelBackground': '#252d3d', 'fontFamily': 'monospace'}}}%%
sequenceDiagram
    participant U as 👤 User
    participant A as 🤖 Agent
    participant T as 🛠️ Tools
    participant M as 🧠 Memory

    U->>A: Billing discrepancy, Acme Corp
    Note over A,T: Step 1: classify
    A->>T: classify_ticket(text)
    T-->>A: billing / medium

    Note over A,T: Step 2: search tickets
    A->>T: search_tickets(query)
    T-->>A: 5 tickets — wrong customer ❌

    Note over A,M: Step 3: customer history
    A->>M: get_customer_history(acme)
    M-->>A: stale — 18 days old ❌

    Note over A,M: Step 4: reasoning
    A->>A: matches #4421 — recurrence

    Note over A,T: Step 5: draft response
    A->>T: draft_response(ref=#4421)
    T-->>A: draft with wrong reference ❌

    A-->>U: Response sent ❌

Three failures in sequence:

search_tickets returned tickets without filtering by customer — wrong tool parameters, no customer ID passed
get_customer_history returned an 18-day-old cache entry — memory staleness not surfaced in the response
The model’s reasoning connected these mismatched results into a coherent (but wrong) narrative

Total tokens: 4,100. Total cost: $0.14. Time elapsed: 3.2 seconds. Zero errors in logs.

4. Cost Attribution: Who Paid for What? 💸

💡 Key Message: Without span-level cost tagging, your monthly LLM invoice is a black box. You know you spent $3,200 — you don’t know that $800 of it came from one agent’s retry loop on a broken tool.

1. Token-Level Attribution 📊

Every span in an agent trace should carry token counts — not just total tokens per run, but input vs. output tokens per step. The ratio tells you where cost is accumulating:

High input tokens in the reasoning step — context window is being loaded with too much irrelevant memory
High output tokens in tool-calling steps — the model is generating verbose tool call arguments; schema needs tightening
Repeated high-cost steps — a loop or retry pattern consuming tokens on a stuck decision

2. Call-Level Cost Breakdown 💰

Agent Step	Model	Input Tokens	Output Tokens	Cost
classify_ticket	claude-haiku-4-5	312	48	$0.002
search_tickets	claude-haiku-4-5	580	210	$0.009
get_customer_history	claude-sonnet-4-6	1,240	180	$0.054
Reasoning step	claude-sonnet-4-6	1,480	520	$0.068
draft_response	claude-sonnet-4-6	890	420	$0.041
Total	—	4,502	1,378	$0.174

The reasoning step and customer history retrieval together account for 70% of the cost — and both consumed inflated input tokens because the context window carried stale, irrelevant memory entries. Fix the memory staleness, and you likely cut this run’s cost in half.

5. Building a Forensics Stack 🏗️

💡 Key Message: You don’t need a custom observability platform — you need consistent span instrumentation and a place to query it. The tooling already exists.

1. Tracing 🔭 Capturing the Decision Trail

The current standard is OpenTelemetry-compatible spans, with LLM-specific semantic conventions emerging under the OpenTelemetry GenAI working group. Three platforms support this today without significant vendor lock-in:

Langfuse — open-source, self-hostable, strong cost tracking per trace. Best choice for teams that want data sovereignty
LangSmith — tight LangChain integration, good dataset/eval tooling. Best for teams already in the LangChain ecosystem
Arize Phoenix — strong on eval and drift detection, open-source. Best when you’re running evals alongside production traces

Instrument at the agent framework level, not per-tool. One decorator wraps the entire run; individual tool calls become child spans automatically.

2. Structured Logging 📝 Making Traces Queryable

Raw spans are not enough — they tell you what happened, not why. Add structured metadata to every span:

{
  "run_id": "run_9f2a1c",
  "agent_step": "tool_call",
  "tool_name": "search_tickets",
  "reasoning_trigger": "find similar billing issues",
  "input_tokens": 580,
  "output_tokens": 210,
  "latency_ms": 840,
  "cost_usd": 0.009,
  "memory_age_seconds": 65340,
  "customer_id": "acme_corp"
}

memory_age_seconds is the field that would have immediately surfaced the 18-day-old cache entry in the failed run above. Add it wherever memory is retrieved.

3. Cost Tagging 🏷️ Attribution at the Call Level

Tag every LLM call with a cost center before it reaches your billing dashboard:

Agent name — which agent type initiated this run
Trigger source — user-initiated, scheduled, webhook, internal
Customer tier — if serving B2B, attribute cost to the customer being served
Task category — billing, support, summarization, etc.

Without these tags, cost spikes are un-debuggable. With them, you can answer: which agent, for which customer category, triggered by which event type, is responsible for the $800 spike on Tuesday.

6. What a Good Trace Looks Like 🎯

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e2433', 'primaryTextColor': '#f0f4ff', 'primaryBorderColor': '#3a4460', 'lineColor': '#6b7fa3', 'secondaryColor': '#252d3d', 'tertiaryColor': '#1a2030', 'background': '#161c2d', 'mainBkg': '#1e2433', 'nodeBorder': '#3a4460', 'clusterBkg': '#252d3d', 'titleColor': '#f0f4ff', 'edgeLabelBackground': '#252d3d', 'fontFamily': 'monospace'}}}%%
graph TD
    RUN["🏃 Run: run_9f2a1c
    agent=support-triage
    trigger=webhook
    customer_tier=enterprise"]

    S1["📋 Span: classify_ticket
    tokens_in=312 tokens_out=48
    cost=$0.002 latency=210ms
    result=billing/medium"]

    S2["🔍 Span: search_tickets
    tokens_in=580 tokens_out=210
    cost=$0.009 latency=840ms
    customer_id=acme_corp ✅"]

    S3["🧠 Span: get_customer_history
    tokens_in=1240 tokens_out=180
    cost=$0.054 latency=620ms
    memory_age_seconds=43 ✅"]

    S4["💭 Span: reasoning
    tokens_in=1480 tokens_out=520
    cost=$0.068 latency=1240ms
    decision=draft_from_ticket_#4380"]

    S5["✉️ Span: draft_response
    tokens_in=890 tokens_out=420
    cost=$0.041 latency=980ms
    reference_ticket=#4380 ✅"]

    RUN --> S1
    S1 --> S2
    S2 --> S3
    S3 --> S4
    S4 --> S5

    style RUN fill:#2a1f3d,stroke:#5a3a7a,color:#f0f4ff
    style S1 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
    style S2 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
    style S3 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff
    style S4 fill:#1e2433,stroke:#3a4460,color:#f0f4ff
    style S5 fill:#1a2d1a,stroke:#3a6a3a,color:#f0f4ff

Every span carries cost, tokens, and latency. Customer ID is propagated through tool calls. Memory age is explicit. The reasoning step shows which ticket it decided to reference — making the output auditable without re-running the agent.

7. The Takeaway 🚀

💡 Key Message: Agent observability is not optional once you have agents in production. It is the difference between a system you can improve and a system you can only restart and hope for the best.

The three questions every agent deployment needs to answer:

What did the agent decide, and why? — answered by reasoning traces
Where did it go wrong? — answered by span-level inputs and outputs compared against expected behavior
What did it cost, and who should it be attributed to? — answered by cost-tagged spans

The tooling exists. Langfuse, LangSmith, and Phoenix all provide what you need without significant overhead. The gap is instrumentation discipline — making it a first-class concern at build time, not a retrofit after the first $500 mystery invoice.

You can’t debug what you can’t see. Agents make invisible decisions by default. Make them visible.

Tags: ai · agents · observability · agentops