ACTIVE AI/ML

GoGameExplanation

CoNLL 2026 submission: Evidence-Based Dominance theorem proves evidence verification is theoretically optimal for eliminating LLM hallucinations in game commentary, achieving 97.9% fabrication reduction in Go and 97x better precision than SelfCheckGPT.

VIEW_SOURCE()

status ACTIVE

type AI/ML

stack Python KataGo Stockfish DeBERTa-v3 Qwen2.5-3B/7B GPT-4o

// DESCRIPTION

The Problem: LLMs Fabricate Facts in Game Commentary

When large language models (LLMs) write commentary for board games like Go, Chess, or Shogi, they routinely invent statistics that never happened. In a comprehensive audit of 500 LLM-generated Go game commentaries, 97% contained at least one fabricated winrate claim — numbers like "White's winning probability jumped to 73%" that no engine ever computed. This is not a minor nuisance: game commentary is used for training materials, broadcast analysis, and player education. Fabricated statistics corrupt all three.

The existing defence — making the LLM check its own outputs for internal consistency (SelfCheckGPT) — fails in this domain. A model that hallucinates consistently will pass its own consistency test. What is needed is external evidence verification: cross-checking every claim against an authoritative oracle (KataGo, Stockfish) that actually computed the number.

问题背景：LLM在棋类解说中大量捏造事实。在对500篇LLM生成的围棋解说进行全面审计后，发现97%包含至少一个虚构的胜率数据——例如“白棋胜率跳升至73%”等从未被引擎计算过的数字。现有的自一致性方法（SelfCheckGPT）在此场景下完全失效：系统性幻觉会通过自我一致性检验。真正需要的是外部证据验证——将每个声明与KataGo、Stockfish等权威引擎计算结果进行交叉核实。

Situation & Task: Building a Verifiable Commentary Pipeline

This project was submitted to CoNLL 2026 (Conference on Computational Natural Language Learning), one of the top venues for NLP research. The task was to build a pipeline that (a) identifies verifiable claims inside free-form text, (b) checks each claim against an authoritative oracle, and (c) rewrites failed claims with correct, verified facts — all while preserving the fluency and analytical depth that makes commentary valuable.

The scope spans four game domains (Go, Chess, Shogi) and two table-grounded NLG benchmarks (WikiTableText, LogicNLG), giving 13 experiments total and enabling cross-domain generalization analysis.

任务背景：本项目投稿至CoNLL 2026。任务是构建一套流水线：(a) 从自由文本中识别可验证声明，(b) 与权威引擎对比验证，(c) 将错误声明改写为包含正确事实的流畅文本。实验横跨围棋、国际象棋、将棋及WikiTableText/LogicNLG五个领域，共13组实验。

Key Innovation: Evidence-Based Dominance Theorem

The central theoretical contribution is the Evidence-Based Dominance (EBD) theorem, which formally proves that evidence verification dominates self-consistency methods in any "δ-verifiable domain" — a domain where an external oracle can confirm or refute claims with precision δ. The theorem provides closed-form bounds showing that as oracle precision δ approaches zero, the precision advantage of evidence verification over consistency checking approaches infinity. This is not an empirical observation; it is a provable mathematical result that establishes the theoretical foundation for the entire approach.

The intuition: consistency methods measure how often the model agrees with itself, but a model that is consistently wrong passes perfectly. Evidence verification measures agreement with reality, which is what actually matters for factual commentary. The theorem formalises this intuition and quantifies the gap.

核心理论创新：证据优势定理（EBD定理）从数学上证明，在任何“δ-可验证域”中，证据验证在精确率上主导自一致性方法。δ-可验证域指存在外部神谕（如棋类引擎）能以精度δ确认或反驳声明的领域。当δ趋近于零时，证据验证相对一致性检验的精确率优势趋向无穷大。这一定理从理论上奠定了整个方法的基础，并量化了两种方法间的差距。

Approach: Four-Stage Verify-and-Rewrite Pipeline

Stage 1 — Claim Extraction: A fine-tuned DeBERTa-v3 model (91.1% F1 on the held-out test set) locates verifiable spans in generated commentary. It distinguishes between verifiable numerical claims (winrates, move evaluations, point differences) and unverifiable analytical statements, avoiding false positives that would over-censor fluent prose.

Stage 2 — Oracle Verification: Each extracted claim is sent to the appropriate game engine. KataGo (the strongest Go AI, rated >6000 ELO) computes board evaluations for Go positions. Stockfish handles Chess. A custom Shogi engine handles Shogi. The verification module returns a verdict (verified / refuted / uncertain) and, for refuted claims, the actual engine value.

Stage 3 — Targeted Rewriting: Refuted claims are replaced by prompting Qwen2.5-3B/7B or GPT-4o with a fact-injection template that embeds the verified value while preserving grammatical context. The LLM is constrained to use the provided verified number, eliminating the possibility of re-hallucinating during rewriting.

Stage 4 — Quality Preservation: BLEU, METEOR, and BERTScore metrics confirm that rewritten commentary maintains linguistic quality. The pipeline preserves >95% of the original BLEU score even after fixing errors, proving that factual correction and fluency are not in conflict.

方法：四阶段验证-改写流水线。第一阶段，基于DeBERTa-v3的声明提取器（F1=91.1%）定位可验证数字声明。第二阶段，KataGo/Stockfish等引擎对声明进行验证，返回“验证/反驳/不确定”判决。第三阶段，使用Qwen2.5或GPT-4o将错误声明改写为包含正确事实的流畅文本，通过约束提示消除改写过程中的二次幻觉。第四阶段，BLEU/METEOR/BERTScore指标确认改写后文本保持了语言质量（>95% BLEU保留率）。

Results: Domain-Spanning Fabrication Elimination

Across the 13 experiments, the pipeline achieves consistent, massive reductions in factual error rate:

Go: 97.9% reduction in fabricated winrate claims (from 97% contamination to 2.1%)
Shogi: 99.1% reduction, the best result across all game domains
Chess: 94.3% reduction
LogicNLG: 95.0% reduction
WikiTableText: 91.2% reduction

Against the strongest baseline, SelfCheckGPT, the pipeline achieves 97x better precision at 1/6 the compute cost. SelfCheckGPT requires 5–6 model samples per claim for consistency estimation; the EBD pipeline requires one engine call, which is orders of magnitude cheaper for game domains.

Key negative finding: Iterative rewriting (applying the pipeline multiple rounds) degrades LogicNLG BLEU by 8–12 points per iteration despite improving factual accuracy. This reveals a fundamental tension between repeated targeted rewriting and coherence for table-grounded NLG, and is reported as a cautionary result for practitioners.

实验结果：13组实验均取得大幅降低幻觉的效果。围棋：错误率降低97.9%；将棋：99.1%（最高）；国际象棋：94.3%；LogicNLG：95.0%；WikiTableText：91.2%。与最强基线SelfCheckGPT相比，精确率高出97倍，而计算开销仅为后者的1/6。重要负面发现：对LogicNLG进行迭代改写会导致每轮BLEU下降8-12点，揭示了反复局部改写与文本连贯性之间的根本矛盾，对实践者具有重要警示意义。

Impact and Future Work

The EBD theorem generalizes beyond board games to any domain where an external verifier exists: financial report checking (against SEC filings), medical claim verification (against clinical databases), scientific fact-checking (against experimental records). The pipeline is designed as a modular framework so that domain-specific verifiers can be swapped in with minimal engineering effort. The CoNLL 2026 submission represents the first theoretically grounded proof that evidence-based verification is the correct approach for factual NLG in verifiable domains.

EBD定理可推广至任何存在外部验证器的领域：财务报告核查、医疗声明验证、科学事实核查等。该流水线设计为模块化框架，可以最小工程代价替换领域专用验证器。本CoNLL 2026投稿是首个从理论上证明证据验证是可验证域中正确方法的工作。

// HIGHLIGHTS

CoNLL 2026 submission — Evidence-Based Dominance theorem mathematically proves evidence verification dominates consistency methods in δ-verifiable domains
97.9% fabrication reduction in Go commentary; 99.1% in Shogi; 95.0% in LogicNLG across 13 experiments
97x better precision than SelfCheckGPT at 1/6 compute cost
DeBERTa-v3 claim extractor achieves 91.1% F1 on held-out game commentary test set
Cross-domain validation across Go, Chess, Shogi, WikiTableText, and LogicNLG
Critical negative result: iterative rewriting degrades LogicNLG BLEU 8–12 pts/round
KataGo (>6000 ELO) and Stockfish integration for authoritative game-state verification
Qwen2.5-3B/7B and GPT-4o constrained fact-injection rewrites maintain >95% original BLEU

< BACK_TO_PROJECTS()