ACTIVE Tools

ProofCraft

Systematic study of LLM-assisted Lean 4 theorem proving: 28.3% pass@1 vs SOTA 99.6%, with the honest finding that model scale gives zero benefit -- but agentic frameworks add +41%.

status ACTIVE
type Tools
started 2024-06-01
stack Python Lean 4 LeanDojo GPT-4 GPT-3.5 Claude Gradio

// DESCRIPTION

The Problem: Can General-Purpose LLMs Help With Lean 4 Theorem Proving?

Formal mathematics is having a moment. Lean 4, an interactive theorem prover and programming language, has demonstrated that machine-checkable mathematical proofs are practical tools, as exemplified by the Lean community's verification of landmark results. The natural question for the LLM era: can off-the-shelf language models — GPT-4, Claude, and their contemporaries — meaningfully accelerate Lean 4 proof search without any domain-specific fine-tuning? Existing literature on neural theorem proving relies heavily on models fine-tuned on proof corpora (Hypertree Proof Search, Lean Copilot, Seed-Prover). ProofCraft asks what happens when you take a general-purpose LLM and combine it with systematic search strategies and a multi-agent framework — and gives honest, reproducible answers, including a critical negative finding.

问题背景:通用 LLM 能否有效辅助 Lean 4 定理证明?

现有神经定理证明文献严重依赖在证明语料库上微调的模型。ProofCraft 探究:当你将通用 LLM 与系统化搜索策略和多智能体框架结合时会发生什么——并给出诚实、可复现的答案,包括一个关键负面发现。

Approach: 17 Systematic Experiments, 5 Search Strategies, Multi-Agent Framework

The evaluation benchmark is miniF2F, a curated set of 244 mathematical competition problems formalized in Lean 4, spanning AMC, AIME, and IMO difficulty levels.

Five search strategies are implemented and compared: BFS, DFS, Best-First Search (LLM log-probability scoring), Monte Carlo Tree Search (MCTS), and Beam Search. All strategies interface with LeanDojo to execute Lean 4 tactics and receive success/failure feedback.

A multi-agent framework deploys specialized agents: a Proposer generates candidate tactics, a Verifier checks them against Lean, and a Critic identifies why failed attempts went wrong. LLM response caching achieves 70-90% API cost reduction across the 17 experiments. A Gradio web UI makes the system accessible for interactive theorem proving sessions.

研究方法:17 个系统实验、5 种搜索策略、多智能体框架

评估基准为 miniF2F——包含 244 道数学竞赛题的精选集合,已在 Lean 4 中形式化,覆盖 AMC、AIME 和 IMO 难度级别。

实现并比较五种搜索策略:BFS、DFS、最佳优先搜索、MCTS 和束搜索。所有策略均通过 LeanDojo 框架与 Lean 4 交互。

多智能体框架中,提议者生成候选策略,验证者对照 Lean 检验,批评者识别失败原因并提出修正建议。LLM 响应缓存在 17 个实验中实现 70-90% 的 API 成本降低。

Results: Honest Findings, Including a Critical Negative Result

Overall performance: 28.3% pass@1 on miniF2F. For context, the current SOTA is Seed-Prover at 99.6% — a domain-specifically fine-tuned system. The gap is large and documented honestly.

Critical finding #1 — Tactic distribution: 97% of all successful proofs use only three tactics: rfl, linarith, and omega. General-purpose LLMs overwhelmingly default to these basic tactics and rarely produce complex tactic sequences needed for harder problems — a fundamental capability limitation, not a prompting problem.

Critical finding #2 — Model scale provides zero benefit: GPT-4 does not outperform GPT-3.5 in a statistically significant way. Theorem proving appears to require specialized training, not just more parameters.

Positive finding — Agentic framework works: The multi-agent Proposer/Verifier/Critic loop achieves a +41% relative improvement over single-agent prompting.

Target venue: ICML 2026.

实验结果:诚实呼现,包括关键负面发现

整体性能:miniF2F 上 pass@1 = 28.3%。当前 SOTA 是 Seed-Prover 的 99.6%。本文诚实记录差距而非选择性汇报结果。

关键发现 #1——策略分布:97% 的成功证明仅使用三个策略:rfllinarithomega。这是根本性的能力限制,而非提示词问题。

关键发现 #2——模型规模无益:GPT-4 在此基准上不比 GPT-3.5 表现出统计显著的优势。

正面发现——智能体框架有效:多智能体提议者/验证者/批评者循环相比单智能体提示实现 +41% 的相对改进

目标投稿会议:ICML 2026

// HIGHLIGHTS

  • Targeting ICML 2026 — 17 systematic experiments on LLM-assisted Lean 4 theorem proving
  • 28.3% pass@1 on miniF2F (244 problems) vs Seed-Prover SOTA 99.6% — documented honestly
  • Honest negative: model scale (GPT-3.5 vs GPT-4) provides zero benefit for theorem proving
  • 97% of successful proofs use only rfl/linarith/omega — fundamental LLM capability gap
  • Multi-agent Proposer/Verifier/Critic achieves +41% relative improvement over single-agent
  • 5 search strategies: BFS, DFS, Best-First, MCTS, Beam Search — all via LeanDojo
  • LLM response caching achieves 70-90% API cost reduction across 17 experiments
  • Gradio web UI for interactive sessions; evaluated on AMC/AIME/IMO-level problems