ROA-LLM
A systematic security study formalizing Persistent Semantic Entities (PSEs) as a new attack surface in tool-augmented LLM agents: all 15 tested models (from 7 vendors) were affected (20-96% success), name binding dominates with Cohen's d=3.26, and self-verification reduces but cannot eliminate the threat. Submitted to ICICS 2026.
// DESCRIPTION
The Problem: Tool-Augmented LLM Agents Have Invisible Attack Surfaces
When you give an LLM agent access to tools — web search, file systems, databases, APIs — you fundamentally change its threat model. The classic attack against LLM systems is prompt injection: a malicious string in the user's input that redirects the model's behavior. But prompt injection is well-studied and increasingly mitigated. ROA-LLM investigates a different, less-understood class of vulnerabilities: attacks that exploit the stateful semantic context that accumulates across a tool-augmented agent's execution trace.
Consider an LLM agent tasked with managing a research database. Over the course of a session, it reads papers, assigns tags, tracks author names, and builds an internal semantic model of the domain. This accumulated context — what we call Persistent Semantic Entities (PSEs) — is not just an artifact of the conversation; it is an attack surface. If an adversary can plant a carefully crafted entity (e.g., a paper with a manipulated author name that shadows a known researcher, or a tag that conflicts with a preference the model has been conditioned on) into the agent's semantic context, they can subtly and persistently redirect the agent's future reasoning without ever touching the system prompt or the user's input. This is a category of vulnerability that the community had not systematically studied.
问题:工具增强型LLM智能体存在隐性攻击面
当LLM智能体获得工具访问权限(网络搜索、文件系统、数据库、API),其威胁模型发生根本改变。经典提示注入攻击已被广泛研究并逐步缓解。ROA-LLM调查了一类不同的、鲜为人知的漏洞:利用工具增强智能体执行轨迹中积累的有状态语义上下文的攻击。这种积累的上下文——即我们所说的持久语义实体(PSE)——本身就是一个攻击面,可在不触及系统提示或用户输入的情况下,持久性地重定向智能体的未来推理。
Situation: No Formal Model for Stateful Semantic Attacks
Prior work on LLM security focused on single-turn adversarial prompts, jailbreaks, and prompt injection. Multi-turn and tool-use scenarios were studied anecdotally but lacked a formal framework. There was no systematic taxonomy of how semantic state persists across tool calls, no model for the attack surface this creates, and no empirical data on which models are most vulnerable or which semantic binding mechanisms are most exploitable. ROA-LLM set out to build all three.
场景:有状态语义攻击缺乏形式化模型
既有LLM安全研究聚焦单轮对抗提示、越狱和提示注入;多轮与工具使用场景仅有零星案例研究,缺乏形式化框架。既无语义状态跨工具调用持久化的系统分类学,也无经验数据说明哪些模型最脆弱、哪些语义绑定机制最可利用。ROA-LLM着手构建这三者。
Task: Formalize PSEs, Build the Infrastructure, Test 15 Models
The formal contribution is a tuple definition: a Persistent Semantic Entity is defined as PSE = (N, T, P), where N is the entity's name binding (how the agent refers to the entity in future reasoning), T is the entity's type schema (what kind of object the agent believes it is), and P is the entity's preference vector (implicit weights the agent applies to this entity in decision-making). Attacks are defined as adversarial modifications to N, T, or P — or combinations thereof — injected through tool outputs.
To test this framework empirically, a substantial infrastructure was built: approximately 2,500 lines of Rust code (using PyO3 for Python interoperability) implementing a sandboxed tool execution environment with deterministic replay, plus approximately 6,500 lines of Python test harness code covering attack generation, evaluation metrics, and statistical analysis. The infrastructure was designed to test 15 distinct LLM models from 7 vendors under identical conditions.
任务:形式化PSE,构建基础设施,测试15个模型
形式贡献:PSE = (N, T, P)元组——N为名称绑定(智能体在未来推理中如何指称该实体),T为类型模式,P为偏好向量。攻击定义为通过工具输出注入的对N、T或P的对抗性修改。实证基础设施:约2500行Rust代码(PyO3交互)实现沙盒工具执行环境,约6500行Python测试套件覆盖攻击生成、评估指标和统计分析。
Action: 2^3 Factorial Ablation Across All PSE Attack Combinations
The experiment design used a 2^3 factorial ablation: each of N, T, and P was either attacked or not attacked, yielding 8 treatment combinations (including the clean baseline). For each combination, each of the 15 models was evaluated on 200 scenarios per combination, totaling 24,000 evaluation runs. Scenarios covered research paper management, code repository analysis, financial data processing, and customer service simulation.
PSE Attack Taxonomy (2^3 Factorial) +----------------+---+---+---+------------------------------+ | Condition | N | T | P | Attack Description | +----------------+---+---+---+------------------------------+ | Clean | - | - | - | No attack (baseline) | | Name only | Y | - | - | Shadow/spoof entity name | | Type only | - | Y | - | Misclassify entity type | | Pref only | - | - | Y | Inject preference weight | | Name+Type | Y | Y | - | Full identity substitution | | Name+Pref | Y | - | Y | Named preference injection | | Type+Pref | - | Y | Y | Typed preference injection | | Full PSE | Y | Y | Y | Complete entity replacement | +----------------+---+---+---+------------------------------+
行动:跨所有PSE攻击组合的2^3因子消融
实验设计采用2^3因子消融:N、T、P各自受攻击或不受攻击,产生8种处理组合。每种组合下每个模型评估200个场景,共24000次评估运行。场景涵盖研究论文管理、代码仓库分析、金融数据处理和客服模拟。
Results: All 15 Models Affected; Name Binding Dominates With Cohen's d=3.26
The results were sobering:
- All 15 models from all 7 vendors were affected, with attack success rates ranging from 20% (best-defended model on isolated name attacks) to 96% (worst-case full PSE replacement on preference-sensitive tasks)
- Name binding (N) is the dominant attack vector: isolated name attacks produced a Cohen's d = 3.26 effect size — an extremely large effect indicating that entity naming is the primary mechanism of semantic control
- No correlation with model scale: Spearman rho = 0.15 between parameter count and attack resistance — larger models are not meaningfully safer
- Self-verification reduces but cannot eliminate the threat: prompting models to verify their own tool outputs before acting reduced attack success by 79% on average, but failed completely on preference injection attacks where the model accepted the injected preference as a legitimate user signal
- External validation achieves 100% elimination: a separate validation agent checking all tool outputs before they enter the primary agent's context fully eliminated successful attacks across all 15 models — at the cost of approximately 2x latency and 2x token cost
This work has been submitted to ICICS 2026 (International Conference on Information and Communications Security), contributing the first formal PSE attack taxonomy, the largest multi-vendor empirical evaluation of tool-augmented LLM agent security, and a practical defense comparison.
结果:15个模型全部受影响;名称绑定以Cohen's d=3.26主导攻击
所有7家厂商的15个模型均受影响(成功率20%-96%);名称绑定是主要攻击向量(Cohen's d=3.26,极大效应量);模型规模与抗攻击性无显著相关(rho=0.15);自验证平均减少79%成功率但无法消除偏好注入攻击;外部验证智能体实现100%消除,代价是约2倍延迟和token成本。该工作已提交至ICICS 2026。
// HIGHLIGHTS
- Submitted to ICICS 2026 (International Conference on Information and Communications Security)
- Formalized Persistent Semantic Entities PSE = (N, T, P) as a new LLM agent attack surface
- All 15 models from 7 vendors affected — attack success rates 20%–96% across scenarios
- Name binding dominates: Cohen's d = 3.26 — largest effect among all PSE attack dimensions
- No scale correlation: Spearman rho=0.15 — larger models are not meaningfully safer
- Self-verify reduces attack success by 79% but fails completely on preference injection
- External validation agent achieves 100% elimination at ~2x latency/token cost
- Built on ~2,500 LOC Rust (PyO3) sandbox + ~6,500 LOC Python harness — 24,000 evaluation runs total