ACTIVE Research

AgentMesh-STM

The largest empirical study of MARL coordination strategies applied to LLM multi-agent systems: 3,400+ episodes across 7 models, 5 strategies, 4 benchmarks, 10 seeds. Key finding: VDN scales linearly on dense tasks (r=0.99) but hurts MoE architectures (r=-0.80); task coupling dominates all other factors (partial eta^2=0.37).

VIEW_SOURCE()

status ACTIVE

type Research

stack ['Python' 'OpenAI API' 'Ollama' 'MARL Algorithms (VDN/CommNet/QMIX/COMA)' 'Three-factor ANOVA']

// DESCRIPTION

The Problem: Which Coordination Strategy Should You Use for LLM Agents?

The field of multi-agent reinforcement learning (MARL) has produced a rich taxonomy of coordination algorithms over the past decade: Value Decomposition Networks (VDN), CommNet, QMIX, COMA, and many others. Each was designed with specific assumptions about agent communication bandwidth, task structure, and reward sharing. Meanwhile, the LLM multi-agent community has been building systems like AutoGen and CrewAI with ad-hoc coordination designs, often borrowing concepts from MARL without systematic validation that those concepts transfer. The core question — which MARL coordination strategy actually works for LLM agents, and under what conditions? — had never been answered empirically at scale. Practitioners were making architectural choices blind, with no principled guidance.

问题：LLM多智能体系统该选择哪种协调策略？
MARL领域已产生丰富的协调算法分类学（VDN、CommNet、QMIX、COMA等），每种都有特定假设。LLM多智能体社区在构建AutoGen、CrewAI等系统时借鉴了MARL概念，却从未在规模上验证这些概念是否真正适用。核心问题——哪种MARL协调策略真正适用于LLM智能体，在何种条件下？——从未被实证回答。从业者在无任何原则性指导的情况下做出架构选择。

Situation: A Large-Scale Controlled Factorial Experiment

AgentMesh-STM was designed as the largest controlled factorial experiment in the MARL-to-LLM transfer literature. The experimental matrix was:

7 LLM backbones: models ranging from 8B to 1T parameters, spanning both proprietary APIs (OpenAI) and local models via Ollama
5 coordination strategies: VDN, CommNet, QMIX, COMA, and a No-Coordination baseline (independent agents)
4 benchmark task types: Dense Reward (continuous feedback), Sparse Reward (terminal feedback only), Temporal (sequential dependency), and Deliberative (collaborative reasoning requiring multi-turn discussion)
10 random seeds per condition for statistical rigor
Total: 3,400+ episodes of multi-agent execution, each comprising up to 20 agent-agent interactions

场景：大规模受控因子实验
AgentMesh-STM设计为MARL到LLM迁移文献中规模最大的受控因子实验：7个LLM骨干（8B到1T参数）、5种协调策略（VDN/CommNet/QMIX/COMA/无协调）、4种基准任务类型（密集奖励、稀疏奖励、时序任务、审议任务）、每条件10个随机种子，共3400+轮次。

Task: Establish the First Empirical Atlas of MARL Strategy Transfer

The task was not merely to measure performance, but to build a causal understanding: Which factors — model scale, strategy choice, or task type — dominate performance outcomes? Are there interaction effects (e.g., does a strategy that helps small models hurt large ones)? Can we make actionable recommendations for practitioners choosing coordination designs?

任务：建立MARL策略迁移的首个实证图谱
任务不仅是测量性能，更是建立因果理解：哪些因素——模型规模、策略选择、任务类型——主导性能结果？是否存在交互效应？能否为从业者提供可操作的协调设计建议？

Action: Three-Factor ANOVA + Permutation Tests + Ablation Studies

The statistical analysis employed a three-factor ANOVA with factors: Strategy (5 levels), Model Scale (7 levels), and Task Type (4 levels). All pairwise interactions were modeled. Permutation tests with 10,000 resamples were used to validate significance without distributional assumptions. Effect sizes were reported as partial eta-squared (eta^2) for interpretability. Ablation studies isolated individual coordination components (message aggregation, reward shaping, credit assignment) to identify which mechanisms drive observed effects.

Experimental Matrix (3,400+ episodes)
+-------------+------+--------+----------+---------+
|  Strategy   |Dense | Sparse | Temporal |  Delib  |
+-------------+------+--------+----------+---------+
|  VDN        | +++  |   +    |    +     |    -    |
|  CommNet    |  +   |  ++    |   ++     |    +    |
|  QMIX       |  +   |   +    |  +++     |   --    |
|  COMA       |  +   | +++    |    +     |   ++    |
|  None       |    baseline across all conditions   |
+-------------+------+--------+----------+---------+
  +/- = vs. no-coordination baseline

行动：三因素ANOVA + 置换检验 + 消融研究
统计分析采用三因素ANOVA（策略×模型规模×任务类型），建模所有双向交互效应。10000次重采样置换检验无分布假设验证显著性，效应量用偏eta^2报告。消融研究隔离各协调组件（消息聚合、奖励塑形、信用分配），识别驱动效应的具体机制。

Results: Task Coupling Dominates; VDN Helps Dense But Hurts MoE

The results resolved several longstanding debates and revealed unexpected interactions:

VDN scales linearly on Dense tasks: Pearson r = 0.99 between model scale and performance improvement under VDN on dense reward tasks
VDN harms Mixture-of-Experts architectures: r = -0.80, p = 0.017 — statistically significant negative effect, counter to the prevailing assumption
QMIX delivers +77% on Temporal tasks but -41% on Deliberative tasks — the strategy's temporal credit assignment mechanism backfires in open-ended multi-turn reasoning
Task coupling is the dominant factor: partial eta^2 = 0.37, accounting for 37% of total performance variance — larger than strategy choice (0.21) and model scale (0.14) combined
No coordination often wins on Deliberative tasks: over-constraining LLM agents with MARL reward structures suppresses the free-form reasoning that makes deliberative tasks tractable

These findings provide the first empirically grounded guidance for practitioners: match coordination strategy to task coupling structure, not model scale or intuition. The full dataset, experiment logs, and analysis code are publicly available on GitHub.

结果：任务耦合主导；VDN有利于密集任务但损害MoE架构
关键发现：VDN在密集奖励任务上线性扩展（r=0.99），但显著损害MoE架构（r=-0.80，p=0.017）；QMIX在时序任务+77%但在审议任务-41%；任务耦合是主导因素（偏eta^2=0.37，超过策略选择0.21与模型规模0.14之和）；无协调在审议任务上往往最优。这些发现为从业者提供了首个实证驱动的协调设计指导：按任务耦合结构选择策略，而非靠直觉。

// HIGHLIGHTS

Largest empirical MARL-to-LLM coordination study: 3,400+ episodes, 7 models (8B–1T), 5 strategies, 4 benchmarks, 10 seeds/condition
VDN scales linearly on Dense tasks (r=0.99) but harms MoE architectures (r=−0.80, p=0.017)
QMIX +77% on Temporal tasks, −41% on Deliberative — temporal credit assignment backfires on open-ended reasoning
Task coupling dominates all factors: partial eta^2=0.37 (vs. strategy 0.21, scale 0.14)
Three-factor ANOVA with 10,000-resample permutation tests — no distributional assumptions
No-coordination baseline wins on Deliberative tasks — MARL reward structures suppress LLM free-form reasoning
First principled empirical atlas for practitioners choosing coordination designs in LLM multi-agent systems

< BACK_TO_PROJECTS()