agenttrace
A non-invasive debugging framework for multi-agent LLM systems that reconstructs causal fault graphs from execution logs and ranks root causes in 0.12 seconds — 69x faster than LLM-based diagnosis — with 94.9% Hit@1 across 550 real-world scenarios. Accepted at ICLR 2026 AIWILD Workshop (arXiv:2603.14688).
// DESCRIPTION
The Problem: Multi-Agent Systems Fail Silently in Production
Modern AI applications increasingly rely on multi-agent systems — orchestrated networks of autonomous AI components that collaborate to complete complex tasks. Frameworks like AutoGen, MetaGPT, and CrewAI power everything from automated code review pipelines to AI research assistants. But when these systems fail — and they do fail — diagnosing the root cause is a nightmare. The failure might occur five agents downstream from the actual bug, log files span thousands of lines of interleaved messages, and the standard fix is either to add more LLM calls asking 'what went wrong?' (expensive, slow, circular) or to have a human engineer manually trace execution paths (slow, error-prone, does not scale). There was no systematic, principled way to say: 'Agent X failed because Agent Y sent it a malformed tool response at step 42.' This gap motivated AgentTrace.
问题背景:多智能体系统的生产故障难以诊断
现代AI应用越来越依赖多智能体系统——由AutoGen、MetaGPT、CrewAI等框架驱动的自动化协作网络。这类系统在生产环境中频繁发生故障,但定位根因极为困难:失败往往发生在距离真正缺陷五个智能体之后,日志跨越数千行交织消息,现有手段要么昂贵(调用LLM诊断)、要么低效(人工追踪)。AgentTrace正是为填补这一空白而生。
Situation: The Debugging Tax on Multi-Agent Development
To quantify the problem, consider a CrewAI pipeline with six agents collaborating on a financial analysis report. Agent A retrieves market data, Agent B validates it, Agent C calls an external API, Agent D aggregates results, Agent E drafts commentary, and Agent F formats the output. If Agent F produces garbage, the naive approach is to replay the entire pipeline with verbose logging — costing 8–12 LLM inference calls just to pinpoint whether the root cause was Agent A's retrieval or Agent C's API call. At a cost of ~$0.05 per inference call and latency of 8.3 seconds per LLM-based diagnosis, debugging a 50-scenario test suite becomes a $25 and 7-minute exercise — every time a developer changes the prompt or adds a new agent. The productivity cost is enormous.
场景量化:调试成本的实际影响
以一个六智能体CrewAI金融分析流水线为例:若输出质量低劣,朴素诊断方式需重放整个流水线、花费8-12次LLM推理(每次约8.3秒),每次调试耗资约25美元且需7分钟。对于包含50个场景的测试套件,每次代码变更的调试开销极为可观。AgentTrace的目标就是将这一成本压缩至接近零。
Task: Build an LLM-Free Root-Cause Localization Engine
The AgentTrace project set out to answer a deceptively simple question: Can we identify which agent caused a multi-agent failure, with high accuracy, without spending a single token of LLM inference at debug time? The answer required solving three sub-problems: (1) capturing execution traces without modifying existing agent code, (2) reconstructing the causal dependency graph from those traces, and (3) ranking the most probable root-cause agents from the graph in milliseconds.
任务目标:构建无LLM根因定位引擎
AgentTrace的核心命题:能否在不消耗任何LLM推理token的情况下,高精度识别多智能体故障根因?需要解决三个子问题:(1)无侵入地捕获执行轨迹;(2)从轨迹重建因果依赖图;(3)在毫秒级完成根因节点排序。
Action: Decorator Instrumentation + Causal Graph Reconstruction
AgentTrace's architecture rests on three pillars.
Pillar 1 — Non-Invasive Tracing via Python Decorators. Rather than requiring developers to instrument their agent code manually, AgentTrace wraps agent entry points and inter-agent communication channels with lightweight Python decorators at import time. Every function call, tool invocation, message send/receive, and exception is recorded with a microsecond-resolution timestamp and a unique span ID that propagates across agent boundaries via thread-local storage and async context variables. This approach is fully transparent: existing AutoGen, MetaGPT, and CrewAI code requires zero modifications.
Instrumentation Layer
|
+-- @trace_agent --> Agent A (Retriever) [span_id=001]
+-- @trace_tool --> Tool Call (API) [span_id=002, parent=001]
+-- @trace_agent --> Agent B (Validator) [span_id=003, parent=001]
+-- @trace_agent --> Agent C (Formatter) [span_id=004, parent=003]
EXCEPTION raised here
Pillar 2 — Causal Graph Reconstruction. After execution, AgentTrace replays the span log and constructs a directed acyclic graph (DAG) where nodes represent agent invocations and edges represent causal dependencies (message flows, shared memory reads, tool call chains). The graph captures temporal ordering, data lineage (which agent's output became which agent's input), and exception propagation paths. A failure node is annotated with its exception type and the full stack trace, while upstream nodes are annotated with output payload hashes — enabling silent corruption detection even without raised exceptions.
Pillar 3 — Feature-Based Root Cause Ranking. With the causal graph built, AgentTrace computes a ranking score for each node using 14 structural and behavioral features: depth in the causal chain, out-degree (how many agents depend on this node's output), payload anomaly score (hash deviation from baseline runs), latency z-score, retry count, and exception inheritance depth. These features are combined via a gradient-boosted ranker trained offline on 3,000 labeled failure scenarios spanning AutoGen, MetaGPT, and CrewAI frameworks. The entire ranking computation runs in-process with no LLM calls, completing in 0.12 seconds.
行动:装饰器插桩 + 因果图重建 + 特征排序
AgentTrace三大支柱:①通过Python装饰器在导入时无侵入包装智能体入口,记录每次函数调用、工具调用、消息收发和异常,携带微秒时间戳和跨智能体边界传播的span ID;②重放span日志,构建有向无环因果依赖图(DAG),标注异常传播路径和数据血缘;③基于14个结构与行为特征(因果链深度、出度、载荷异常得分、延迟z-score等),通过离线训练的梯度提升排序器在0.12秒内完成根因节点排序,全程无需LLM推理。
Results: 94.9% Hit@1 at 69x the Speed of LLM Diagnosis
AgentTrace was evaluated on 550 failure scenarios covering 11 distinct failure modes (tool timeout, payload schema mismatch, circular dependency, silent corruption, etc.) across real AutoGen, MetaGPT, and CrewAI traces. The results are compelling:
- Hit@1: 94.9% — the true root-cause agent is ranked #1 in 94.9% of cases
- Hit@3: 98.4% — the true root cause appears in the top-3 in 98.4% of cases
- Latency: 0.12 seconds vs. 8.3 seconds for LLM-based diagnosis (69x speedup)
- Zero LLM calls at debug time — all inference cost eliminated
- Cross-framework generalization — a single model trained on mixed traces works on all three frameworks without framework-specific tuning
The D3.js-based interactive visualization layer allows engineers to inspect the causal graph in a browser, drill into any agent's span timeline, and compare the ranked root-cause list against the actual failure annotation — making AgentTrace not just a diagnostic engine but a learning tool for understanding agent behavior. AgentTrace was accepted at the ICLR 2026 AIWILD Workshop (arXiv:2603.14688), representing the first published framework to achieve sub-second, LLM-free root cause localization in production multi-agent systems.
结果:94.9% Hit@1,速度69倍于LLM诊断
在涵盖11种故障模式的550个真实场景(AutoGen/MetaGPT/CrewAI轨迹)上评估:Hit@1达94.9%,Hit@3达98.4%,单次诊断延迟0.12秒(LLM方案为8.3秒,提速69倍),调试时零LLM调用。D3.js可视化层支持工程师在浏览器中交互式检查因果图。该成果已被ICLR 2026 AIWILD Workshop接收(arXiv:2603.14688),是首个发表的亚秒级、无LLM多智能体根因定位框架。
Publication
Accepted at AIWILD Workshop @ ICLR 2026
Read the paper on arXiv: 2603.14688 →
发表论文
已被接收:AIWILD Workshop @ ICLR 2026
Publication
Accepted at AIWILD Workshop @ ICLR 2026
Read the paper on arXiv: 2603.14688 →
发表论文
已被接收: AIWILD Workshop @ ICLR 2026
// HIGHLIGHTS
- Accepted at ICLR 2026 AIWILD Workshop (arXiv:2603.14688)
- 94.9% Hit@1 root-cause localization across 550 multi-agent failure scenarios
- 98.4% Hit@3 — true root cause in top-3 predictions in nearly all cases
- 0.12s diagnosis latency vs. 8.3s LLM baseline — 69x speedup, zero LLM calls at debug time
- Non-invasive Python decorator instrumentation — zero code changes to AutoGen, MetaGPT, or CrewAI
- Causal DAG reconstruction from execution logs + 14-feature gradient-boosted ranker
- Validated on real traces across 3 major multi-agent frameworks and 11 failure mode categories
- Interactive D3.js causal graph visualization for human-in-the-loop fault inspection