New system to verify machine learning without data disclosure

Machine Learning


Researchers are increasingly focused on ensuring the integrity of machine learning models, especially as their use expands into sensitive applications. Nikolas Melissaris from CNRS, along with Jiayi Xu and Antigoni Polychroniadou, Akira Takahashi, and Chenkai Weng from JPMorgan AI Research, will present ZKBoost, a new zero-knowledge proof training (zkPoT) protocol designed specifically for XGBoost models. This work represents a major advance by providing the first cryptographic guarantees of correct XGBoost training on committed datasets without disclosing either the data itself or the model parameters. The authors demonstrate a practical zkPoT on a real-world dataset through a fixed-point XGBoost implementation, a generic zkPoT template, and a Vector Forgetting Linear Evaluation (VOLE)-based instantiation, maintaining accuracy comparable to standard XGBoost within 1%.

Validating XGBoost model training with zero-knowledge proofs for data privacy enables secure and reliable machine learning applications

Researchers have developed ZKBoost, the first zero-knowledge proof training protocol for XGBoost. This addresses the critical need for cryptographic guarantees of model integrity in sensitive applications. As machine learning models become increasingly popular, ensuring their trustworthiness and accountability is paramount, and this effort provides a solution to verify correct training without revealing private data or model parameters.
This work introduces a new approach to prove that a model was truly obtained by training on a committed dataset with specified hyperparameters, preventing malicious shortcuts such as handcrafted models and illegal data manipulation. ZKBoost allows providers to convince verifiers of this truth without disclosing information beyond the effectiveness of the training process.

ZKBoost’s key innovation is a fixed-point implementation of XGBoost, designed to be compatible with the arithmetic circuitry required for zero-knowledge proof systems. While standard XGBoost relies on floating-point arithmetic, which poses challenges for cryptographic verification, this new implementation utilizes fixed-point arithmetic that is deterministic and has limited precision.

Empirical results show that this fixed-point version maintains accuracy within 1% of standard floating-point XGBoost, an important achievement for practical applications. This compatibility with arithmetic circuits enables efficient instantiation of zero-knowledge proofs, paving the way for reliable machine learning services and distributed models.

Additionally, the researchers developed CertXGB, an authentication algorithm that abstracts arithmetic circuitry to efficiently verify the origin of a model. This algorithm enables parallel validation of each tree in the XGBoost ensemble, which is significantly more efficient compared to sequentially rerunning the training steps.

CertXGB’s general-purpose design allows for integration with general-purpose zero-knowledge proof backends, providing flexibility and adaptability. This breakthrough has implications for a variety of applications, including trusted machine learning as a service, distributed machine learning, and data limit compliance.

This work demonstrates the feasibility of applying zero-knowledge proofs to gradient-boosted decision trees, especially XGBoost, a widely used technique for structured data. By enabling cryptographic verification while achieving nearly the same accuracy as standard XGBoost, ZKBoost is an important step toward ensuring the integrity and provenance of machine learning models in real-world deployments.

Building circuits for fixed-point arithmetic and training zero-knowledge proofs is essential for efficient and verifiable machine learning.

The fixed-point XGBoost implementation powers the development of ZKBoost, a zero-knowledge proof training protocol. This implementation is specifically designed to be compatible with arithmetic circuits and is an important step toward enabling efficient zkPoT of gradient-boosted decision trees. The researchers achieved this by using fixed-point arithmetic to represent the XGBoost calculations, allowing them to be translated into a form suitable for circuitry suitable for cryptographic verification.

Next, this study built a generic zkPoT template tailored for XGBoost training. This template facilitates instantiation in a general-purpose zero-knowledge proof backend and provides flexibility in the choice of encryption system. Central to this template is the ability to prove the correctness of each step in the XGBoost training process without revealing the underlying data or model parameters.

This was achieved by splitting the training procedure into a series of verifiable calculations. To address the challenges associated with proving nonlinear fixed-point operations, vector-oblivious linear evaluation (VOLE)-based instantiation was introduced in this study. VOLE allows safe evaluation of linear functions on masked data. This is essential to handle the nonlinearities inherent in fixed-point arithmetic within zero-knowledge proofs.

This technique allows the verifier to check the correctness of these operations without learning information about the input. Experiments demonstrate that this fixed-point implementation maintains 1% accuracy compared to standard XGBoost while simultaneously enabling practical zkPoT on real-world datasets.

Maintaining this level of accuracy is important to ensure the practicality of the validated model and shows that cryptographic validation does not significantly compromise performance. This methodology highlights a pathway to deploying robust and reliable machine learning models in sensitive applications.

Fixed-point arithmetic enables zero-knowledge proof of XGBoost model training without revealing sensitive data

ZKBoost achieves a very high level of accuracy, maintaining 1% accuracy compared to the standard XGBoost implementation. This fixed-point XGBoost implementation is compatible with arithmetic circuits and is an important step toward enabling efficient zero-knowledge proof training (zkPoT). This work introduces the first zkPoT protocol for XGBoost, allowing model owners to demonstrate correct training on committed datasets without exposing sensitive data or model parameters.

A key component of this work is CertXGB, a generic template for XGBoost’s zkPoT that can be integrated with generic zero-knowledge proof (ZKP) backends. This validation algorithm efficiently verifies that a model was generated correctly by running a fixed-point XGBoost algorithm on a given dataset.

Validation of each tree in the XGBoost ensemble can be performed independently and in parallel, greatly increasing efficiency. Additionally, this study details Vector Oblivious Linear Evaluation (VOLE)-based instantiation and demonstrates practical zkPoT performance on real-world datasets. This instantiation includes an improved ZKP subcomponent to safely prove nonlinear fixed-point operations such as comparison, division, and truncation. This effort also addresses potential security vulnerabilities related to arithmetic overflows and strengthens the robustness of the zkPoT process.

Zero-knowledge proofs use fixed-point arithmetic to maintain the accuracy of XGBoost models while ensuring privacy.

Gradient-boosted decision trees are a powerful technique for analyzing tabular data. A new protocol, ZKBoost, facilitates zero-knowledge proof training of XGBoost models and addresses the growing need for cryptographic guarantees of model integrity in sensitive applications. This system allows model owners to demonstrate correct training on committed datasets without exposing either the data itself or the model’s parameters.

ZKBoost accomplishes this through a fixed-point XGBoost implementation that is compatible with arithmetic circuits, along with a generic template for zero-knowledge proof training that can be applied to a variety of zero-knowledge proof backends. A key innovation lies in the use of vector-forgetting linear evaluation to solve the challenges involved in proving nonlinear fixed-point operations.

Importantly, this fixed-point implementation maintains nearly the same accuracy as standard XGBoost, with a proven 1% accuracy retention while enabling practical cryptographic verification on real-world datasets. The authors acknowledge that the current implementation relies on certain cryptographic assumptions and optimizations.

Future research may focus on exploring alternative cryptographic primitives to further enhance the protocol’s efficiency and extend its applicability. Nevertheless, ZKBoost represents an important step towards reliable machine learning, providing a means to verify model integrity and paving the way for applications that require training provenance and data privacy guarantees.



Source link