ABodyBuilder3: A scalable and accurate model for antibody structure prediction

Screenshot 2024-06-08 at 9.17.39 PM — https://arxiv.org/abs/2405.20863

Accurately predicting antibody structure is essential for the development of monoclonal antibodies, which are crucial in immune responses and therapeutic applications. Antibodies have two heavy and two light chains, and the variable regions contain six CDR loops that are essential for binding to antigens. The CDRH3 loop is the most challenging due to its diversity. Traditional experimental methods to determine antibody structure are often time-consuming and costly. As a result, computational methods such as IgFold, DeepAb, ABlooper, and ABodyBuilder, as well as new models such as xTrimoPGLMAb, have emerged as effective tools for accurate antibody structure prediction.

Researchers from Exscientia and the University of Oxford have developed ABodyBuilder3, an advanced model for predicting antibody structures. This new model builds on ABodyBuilder2 and integrates language model embeddings to improve the accuracy of CDR loop predictions. ABodyBuilder3 also improves structure predictions through sophisticated relaxation techniques and introduces local distance difference tests (pLDDTs) for more accurate uncertainty estimation. Key improvements include updates to data curation, sequence representation, and the structure refinement process. These advances make ABodyBuilder3 a scalable solution for the accurate evaluation of many therapeutic antibody candidates.

To enhance antibody structure modeling, the researchers developed a more efficient and scalable version of ABodyBuilder2 that incorporates vectorization and optimizations from OpenFold. They used mixed precision and bfloat16 for training, resulting in more than three times faster performance and more efficient memory usage. They trained on the Structural Antibody Database (SAbDab) and refined the dataset by filtering outliers, very long CDRH3 loops, and low-resolution structures. To improve the robustness of the model, they used a large validation and test set focused on human antibodies. Their refinement strategy using OpenMM and YASARA enabled significant improvements over ABodyBuilder2, with improved structure accuracy, especially in the antibody framework.

To improve antibody structure modeling, the researchers replaced ABodyBuilder2's one-hot encoding with an embedding of the ProtT5 language model, pre-trained on billions of protein sequences. They generated separate embeddings for the heavy and light chains, which they combined to create the full variable region. Although the researchers tested antibody-specific models such as IgT5 and IgBert, the general protein language model performed better, likely avoiding issues such as dataset contamination and overfitting. Using ProtT5, the researchers set a low initial learning rate and tuned the learning rate scheduler to ensure stability. This new model, ABodyBuilder3-LM, showed reduced RMSD, especially for the CDRH3 and CDRL3 loops.

To enhance the estimation of uncertainty in antibody structure prediction, ABodyBuilder3 replaces the ensemble-based confidence approach of ABodyBuilder2 with per-residue lDDT-Cα scores used in AlphaFold2. This method of predicting accuracy directly from a single model significantly reduces the computational cost. The pLDDT score is calculated by projecting the residue-level predictions into bins through a neural network and comparing it to the actual structure. This approach improves the correlation between the predicted uncertainty and RMSD, especially for the ProtT5 embedding. The pLDDT score of a model effectively predicts the accuracy in the CDR regions, with higher scores indicating lower RMSD in critical regions such as CDRH3.

In conclusion, ABodyBuilder3 is an advanced antibody structure prediction model based on ABodyBuilder2 with important enhancements that improve scalability and accuracy. The model achieves better performance by optimizing hardware usage and improving data processing and structure prediction methods. Incorporating language model embeddings especially for the CDRH3 region and using pLDDT scores for uncertainty estimation removes the need for computationally intensive ensemble models. In the future, self-distillation techniques and pre-training on synthetic datasets could be explored to increase prediction accuracy. Furthermore, combining pLDDT with ensemble approaches may improve results even at higher computational loads.

Please check paper. All credit for this work goes to the researchers of this project. Also, don't forget to follow us: twitter. participate Telegram Channel, Discord Channeland LinkedIn GroupsUp.

If you like our work, you will love our Newsletter..

Please join us 44k+ ML Subreddit

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at Indian Institute of Technology Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of AI and real-world solutions.

🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…

Source link