ACTIVE Research

Belief-Dynamics

Proving that standard multi-agent LLM consensus is mathematically broken under adversarial conditions, and proposing Oracle-Injection Trust (OIT) as the fix -- lifting accuracy from 0% to 64-70%.

status ACTIVE
type Research
started 2024-11-01
stack Python Ollama Cloud API TruthfulQA GSM8K kimi-k2:1T deepseek-v3.1:671B

// DESCRIPTION

The Problem: DeGroot Averaging Guarantees False Consensus

Multi-agent LLM systems are widely promoted as a path to more reliable AI reasoning: let several models debate, then take the consensus. The intuition seems sound. The mathematics, however, is damning. When even a small number of adversarial agents are present in a DeGroot-style averaging network, the mean-preservation property of the update rule mathematically guarantees that the aggregate belief can never cross the decision threshold needed to reach a correct answer. No amount of additional debate rounds helps. The group is trapped in a false-consensus attractor from which it cannot escape. This research begins with that proof and then asks: what does it take to break the attractor?

问题背景:DeGroot 平均机制在对抗场景下的数学缺降

多智能体大语言模型(LLM)系统被广泛视为提升 AI 推理可靠性的有效途径——让多个模型相互辩论,最终取共识。这一直觉听上去合理,但数学分析却给出了严酷的结论。当 DeGroot 平均网络中存在哪怕少量对抗性智能体时,更新规则的“均值保持”特性在数学上保证了群体信念永远无法突破判断阈値、得到正确答案。无论进行多少轮额外辩论都无济于事——群体被锁死在虚假共识吸引子中,无法自行逃脱。本研究从这一证明出发,进而追问:打破吸引子到底需要什么?

Approach: Oracle-Injection Trust (OIT)

The proposed solution is conceptually clean: inject a verified oracle belief into the network at strategic intervals. Rather than relying on internal debate to self-correct, OIT sources ground-truth signals from a trusted external oracle and blends them into the multi-agent belief state, breaking the mean-preservation constraint that enables the false-consensus trap. The injection strategy, timing, and oracle selection are all ablated systematically.

Experiments use a 6-model suite running via Ollama Cloud API, spanning the frontier of open-weight models: kimi-k2:1T, deepseek-v3.1:671B, glm-4-0520, glm-4.7 (air), qwen3:235B-A22B, and deepseek-r1:671B. Each model operates as an independent agent with its own belief state. The evaluation harness is TruthfulQA (100 questions selected for adversarial difficulty) combined with GSM8K for arithmetic grounding. A full 6x6 cross-model verification matrix records every pairwise agreement and disagreement, enabling fine-grained analysis of which model families trust each other.

Ablations systematically vary: (1) injection strategy — cold-start, periodic, threshold-triggered; (2) corruption rate — fraction of adversarial agents, swept 0% to 80%; (3) agent scaling — from 2 to 55 agents. A key structural hypothesis tested: does phylogenetic distance between model families modulate trust and consensus quality?

研究方法:Oracle 注入信任机制(OIT)

OIT 不依赖内部辩论来自我修正,而是从可信的外部 Oracle 获取真实信号,将其融入多智能体的信念状态,从根本上打破导致虚假共识陷阱的均值保持约束。注入策略、时机与 Oracle 选择均经过系统性消融实验验证。

实验采用通过 Ollama Cloud API 运行的 6 模型套件,覆盖开源权重模型前沿:kimi-k2:1T、deepseek-v3.1:671B、glm-4-0520、glm-4.7(air)、qwen3:235B-A22B 以及 deepseek-r1:671B。完整的 6x6 跨模型验证矩阵记录每对模型间的同意与分歧。

消融实验系统性地调整:(1) 注入策略——冷启动、定期注入、阈値触发;(2) 污染率——从 0% 扫描至 80%;(3) 智能体规模——从 2 个扩展到 55 个。

Results: From 0% to 64-70% Accuracy, Statistically Significant

Naive multi-agent consensus without OIT achieves 0% accuracy on adversarial TruthfulQA scenarios — the mathematical prediction holds empirically. OIT lifts this to 64-70% (p<0.001, confirmed across all oracle selection strategies). The best-performing oracle is glm-4.7, which alone achieves 100% individual accuracy on the test set. The system exhibits a sharp phase transition at 50% adversarial corruption: below this threshold OIT consistently recovers correct consensus; above it performance degrades steeply. Scaling analysis shows OIT remains effective up to 55 agents. Oracle agents must have a family distance ≥2 from corrupted agents in model-family taxonomy to provide genuine diversity.

Target venues: UAI 2026 and NeurIPS 2026.

实验结果:准确率从 0% 跃升至 64-70%,统计显著

未使用 OIT 的朴素多智能体共识在对抗性 TruthfulQA 场景下准确率为 0%——数学预测在实验中得到完整验证。OIT 将准确率提升至 64-70%(p<0.001)。表现最佳的 Oracle 是 glm-4.7,其单独在测试集上的准确率高达 100%。系统在 50% 对抗污染率处呈现明显的相变,Oracle 智能体与被污染智能体的家族距离需 ≥2

目标投稿会议:UAI 2026NeurIPS 2026

// HIGHLIGHTS

  • Targeting UAI 2026 & NeurIPS 2026 — mathematically proves DeGroot averaging fails under adversarial agents
  • Oracle-Injection Trust (OIT) lifts naive consensus accuracy from 0% to 64-70% (p<0.001)
  • Best oracle (glm-4.7) achieves 100% individual accuracy on TruthfulQA adversarial subset
  • 6-model frontier suite: kimi-k2:1T, deepseek-v3.1:671B, qwen3:235B-A22B and 3 more via Ollama Cloud
  • Full 6x6 cross-model verification matrix across 100 TruthfulQA questions
  • Sharp phase transition at 50% adversarial corruption, scales to 55 agents
  • Family distance ≥2 identified as structural prerequisite for effective oracle diversity
  • Ablations on injection strategy, corruption rate, and agent scaling confirm robustness of OIT