Large-scale linguistic models (LLMs) exhibit impressive mathematical inference capabilities, but their solutions often contain errors that cannot be automatically verified. Formal theorem research systems such as Lean 4 have motivated recent efforts to provide automated verification with full accuracy and build specialized Prover LLMs that generate verifiable proofs in formal languages. However, there is still a big gap. Current Prover LLMS solves significantly fewer problems than typical LLMs that run in natural languages. We present Hilbert, an agent framework that fills this gap by combining the complementary strength of informal inference with formal verification. Our system coordinates four components. A specialized Prover LLM optimized for unofficial LLM with excellent mathematical inference, Lean 4 tactics, formal verification agents, and semantic theorem retrievers. Given the problem that Prover cannot resolve, Hilbert employs a recursive decomposition to split the problem into sub-goals that are resolved with Prover or Reasser LLM. Use validator feedback to improve false proofs when necessary. Experimental results show that Hilbert far outperforms existing approaches in the main benchmarks, achieving a point of 99.2% on Minif2F and 6.6% above the best published method. Hilbert achieves the best-known results in Putnambench. It solves the 462/660 problem (70.0%), surpassing unique approaches such as seed probers (50.4%), achieving a 422% improvement over the best publicly available baseline. Hilbert therefore effectively narrows the gap between informal reasoning and formal evidence generation.
- †UC San Diego
- ** Work done at Apple
