A model that proves itself right

How can you trust the accuracy of a trained model for a particular input of interest? Model accuracy is typically measured averaged over a distribution of inputs and is not guaranteed for fixed inputs. This paper proposes a theoretically grounded solution to this problem. It is to train a self-proving model that proves the correctness of its output against the verification algorithm V via interactive proof. A self-proving model is one in which the model produces a correct output with high probability for inputs sampled from a given distribution and successfully proves its correctness to V. The soundness property of V ensures that, for all inputs, no model can convince V of the correctness of its erroneous outputs. Therefore, a self-proving model will prove most of its outputs correct, but any incorrect outputs (of any model) will be detected by V. We devise and analyze two general methods for learning self-proving models. One is transcriptional learning (TL), which relies on access to records of accepted interactions, and the other is reinforcement learning from verifier feedback (RLVF), which trains the model by emulating interactions with verifiers.