ProofCraft
Systematic study of LLM-assisted Lean 4 theorem proving: 28.3% pass@1 vs SOTA 99.6%, with the honest finding that model scale gives zero benefit -- but agentic frameworks add +41%.
// DESCRIPTION
The Problem: Can General-Purpose LLMs Help With Lean 4 Theorem Proving?
Formal mathematics is having a moment. Lean 4, an interactive theorem prover and programming language, has demonstrated that machine-checkable mathematical proofs are practical tools, as exemplified by the Lean community's verification of landmark results. The natural question for the LLM era: can off-the-shelf language models — GPT-4, Claude, and their contemporaries — meaningfully accelerate Lean 4 proof search without any domain-specific fine-tuning? Existing literature on neural theorem proving relies heavily on models fine-tuned on proof corpora (Hypertree Proof Search, Lean Copilot, Seed-Prover). ProofCraft asks what happens when you take a general-purpose LLM and combine it with systematic search strategies and a multi-agent framework — and gives honest, reproducible answers, including a critical negative finding.
问题背景:通用 LLM 能否有效辅助 Lean 4 定理证明?
现有神经定理证明文献严重依赖在证明语料库上微调的模型。ProofCraft 探究:当你将通用 LLM 与系统化搜索策略和多智能体框架结合时会发生什么——并给出诚实、可复现的答案,包括一个关键负面发现。
Approach: 17 Systematic Experiments, 5 Search Strategies, Multi-Agent Framework
The evaluation benchmark is miniF2F, a curated set of 244 mathematical competition problems formalized in Lean 4, spanning AMC, AIME, and IMO difficulty levels.
Five search strategies are implemented and compared: BFS, DFS, Best-First Search (LLM log-probability scoring), Monte Carlo Tree Search (MCTS), and Beam Search. All strategies interface with LeanDojo to execute Lean 4 tactics and receive success/failure feedback.
A multi-agent framework deploys specialized agents: a Proposer generates candidate tactics, a Verifier checks them against Lean, and a Critic identifies why failed attempts went wrong. LLM response caching achieves 70-90% API cost reduction across the 17 experiments. A Gradio web UI makes the system accessible for interactive theorem proving sessions.
研究方法:17 个系统实验、5 种搜索策略、多智能体框架
评估基准为 miniF2F——包含 244 道数学竞赛题的精选集合,已在 Lean 4 中形式化,覆盖 AMC、AIME 和 IMO 难度级别。
实现并比较五种搜索策略:BFS、DFS、最佳优先搜索、MCTS 和束搜索。所有策略均通过 LeanDojo 框架与 Lean 4 交互。
多智能体框架中,提议者生成候选策略,验证者对照 Lean 检验,批评者识别失败原因并提出修正建议。LLM 响应缓存在 17 个实验中实现 70-90% 的 API 成本降低。
Results: Honest Findings, Including a Critical Negative Result
Overall performance: 28.3% pass@1 on miniF2F. For context, the current SOTA is Seed-Prover at 99.6% — a domain-specifically fine-tuned system. The gap is large and documented honestly.
Critical finding #1 — Tactic distribution: 97% of all successful proofs use only three tactics: rfl, linarith, and omega. General-purpose LLMs overwhelmingly default to these basic tactics and rarely produce complex tactic sequences needed for harder problems — a fundamental capability limitation, not a prompting problem.
Critical finding #2 — Model scale provides zero benefit: GPT-4 does not outperform GPT-3.5 in a statistically significant way. Theorem proving appears to require specialized training, not just more parameters.
Positive finding — Agentic framework works: The multi-agent Proposer/Verifier/Critic loop achieves a +41% relative improvement over single-agent prompting.
Target venue: ICML 2026.
实验结果:诚实呼现,包括关键负面发现
整体性能:miniF2F 上 pass@1 = 28.3%。当前 SOTA 是 Seed-Prover 的 99.6%。本文诚实记录差距而非选择性汇报结果。
关键发现 #1——策略分布:97% 的成功证明仅使用三个策略:rfl、linarith 和 omega。这是根本性的能力限制,而非提示词问题。
关键发现 #2——模型规模无益:GPT-4 在此基准上不比 GPT-3.5 表现出统计显著的优势。
正面发现——智能体框架有效:多智能体提议者/验证者/批评者循环相比单智能体提示实现 +41% 的相对改进。
目标投稿会议:ICML 2026。
// HIGHLIGHTS
- Targeting ICML 2026 — 17 systematic experiments on LLM-assisted Lean 4 theorem proving
- 28.3% pass@1 on miniF2F (244 problems) vs Seed-Prover SOTA 99.6% — documented honestly
- Honest negative: model scale (GPT-3.5 vs GPT-4) provides zero benefit for theorem proving
- 97% of successful proofs use only rfl/linarith/omega — fundamental LLM capability gap
- Multi-agent Proposer/Verifier/Critic achieves +41% relative improvement over single-agent
- 5 search strategies: BFS, DFS, Best-First, MCTS, Beam Search — all via LeanDojo
- LLM response caching achieves 70-90% API cost reduction across 17 experiments
- Gradio web UI for interactive sessions; evaluated on AMC/AIME/IMO-level problems