Introducing ProtST: A framework that enhances pre-training and understanding of protein sequences from biomedical texts

AI and ML Jobs


https://arxiv.org/abs/2301.12040

A large language model could dive into almost any domain. From natural language processing and natural language understanding to computer vision, these models have incredible capabilities that provide solutions for all areas of artificial intelligence. Advances in artificial intelligence and machine learning have shown how these language models can be used to predict protein structure and its function. Protein language models (PLMs) pre-trained on large protein sequence datasets have demonstrated their ability to enhance predictions of protein structure and function.

Proteins are essential for biological growth, cell repair and regeneration, and have important applications in drug discovery and healthcare. Currently, existing PLMs only learn protein expression while recording co-evolutionary information based on protein sequence, and do not include other important properties such as protein function and subcellular location. These models lack the ability to explicitly capture protein function.

For many proteins, textual characterizations are available that provide insight into their key functions and properties. To explore this further, the research team introduced ProtST, a framework that uses biomedical texts to improve pre-training and understanding of protein sequences. The team also developed his dataset, called ProtDescribe, which combines protein sequences with text describing their function and other properties. The ProtST framework, based on the ProtDescribe dataset, aims to preserve the expressive power of traditional PLM when acquiring co-evolutionary information during the process of pre-training.

🚀 Check out 100’s of AI Tools at the AI ​​Tools Club

Three separate jobs were created to add protein property data of varying granularity to PLM during the pre-training stage while preserving the initial expressiveness of the model. The first is unimodal mask prediction. This aims to preserve the ability of PLM to record co-evolutionary information using masked protein modeling. This model is trained to predict the masked parts based on the surrounding context by masking out certain regions of the protein sequence. This ensures that PLM retains its ability to represent as you add more property data.

The second is Multimodal Representation Alignment, which aligns protein sequences and their associated textual representations. Structured textual representations of protein feature descriptors are extracted using biological language models, and following alignment of protein sequences to these textual representations, PLM provides a semantic analysis between the sequences and their textual descriptions. Relationships can be recorded.

The third task, multimodal mask prediction, defines detailed dependencies between residues in the protein sequence and words in the protein property description. To create a multimodal representation of both residues and words, we use the fusion module to predict masked residues and words. This allows PLM to record textual descriptions of complex connections between protein sequences and their properties.

As a result of the evaluation, the team found that ProtST’s supervised learning utilizes enhanced protein representations to obtain better performance on various representation learning benchmarks. In many of these representation learning tasks, ProtST-guided PLM performs better than previous models. ProtST showed excellent performance in zero-shot protein classification in a zero-shot environment. As a result, the trained model was able to classify proteins into several functional categories, even in classes that were not present during training. ProtST can also be used to search for functional proteins from large databases without the need for functional annotation.

In conclusion, this framework for enhancing protein sequence pre-training and comprehension with biomedical texts seems promising and a good addition to the progress of AI.


Please check paper and Github link.don’t forget to join 26,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more. If you have any questions regarding the article above or missed something, feel free to email us. Asif@marktechpost.com

🚀 Check out 800+ AI tools in the AI ​​Tools Club

Tanya Malhotra is a final year student at the University of Petroleum and Energy Research, Dehradun, graduating with a Bachelor of Science in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
A data science enthusiast with good analytical and critical thinking, she has a keen interest in learning new skills, leading groups, and managing work in an organized manner.

🚀 Transform Selfies into AI Generated Headshots: Try the #1 AI Headshot Generator Today



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *