Tevogen Bio’s journey to streamlining life-saving treatments

Machine Learning


Accelerating the 10-year drug discovery process

Developing a drug costs more than $3 billion and requires 10 to 12 years of investment to bring a product to market. These directly impact issues related to the accessibility and cost of a particular product.

To address these issues, Tevogen Bio created the patented ExacTcell platform, which targets specific viruses, tumors, or neurological diseases to a single HLA restriction. Initial target selection for a proof-of-concept study against a single virus candidate, SARS-COV2, was performed using a manual method. Although a single HLA-restricted product could cover a large portion of the population, it required a significant investment of time and resources, taking 18 to 24 months to test and confirm using wet lab science.

To meet Tevogen’s mission statement of delivering faster, cheaper, and more accessible care, Tevogen.AI is partnering with Microsoft and Databricks to optimize the scientific understanding of its core platform, as well as streamline and accelerate its pipeline to additional indications.

The challenge statement was to capture and create a library of protein sequences across a variety of diseases that would allow scientists and researchers to transform processes that once took months into days and now hours.

Additionally, this dataset is used to train Tevogen.AI’s patented underlying algorithm model, backed by Tevogen Bio’s proprietary science. Tevogen management also provided the challenge of curating a dataset of known gene proteins to train an algorithmic model that uses machine learning techniques to predict immunologically active peptides.

Bottlenecks: Wrangling multi-terabyte datasets

To curate this dataset, the team faced the unique challenge of needing to source and organize a multi-terabyte scale dataset with relevant features to facilitate algorithm training. This created two major problems:

  1. Create data pipelines to quickly retrieve and organize relevant information using multi-level cleansing and filtering.
  2. Convert processes designed to run serially to parallel.

Databricks proved to be an important partner here.

Build a modern data lakehouse with Databricks

We chose the Databricks platform as the foundation for our modernization efforts. Leveraging the power of the Medallion Architecture and Unity Catalog, we built a number of pipelines that carefully store data in bronze, silver, and gold layers while maintaining strict governance and fine-grained access controls.

By leveraging the power of distributed computing and a clean structure, we were able to reduce the process time from 50 days to 24 hours. The medallion architecture also served as the foundation for developing various machine learning (ML) models.

Thanks to the experts of the Professional Services team, with personal thanks to Vibhor Nigam and Mohamad Abafoul, Tevogen.AI was able to perform large-scale processing and accumulate a dataset consisting of 24 million proteins. We then purified and classified these datasets to derive 16 billion data points and approximately 700 million unique peptides from the bronze to silver layers of the medallion architecture. Additionally, we were able to curate approximately 37 million mutually matching expert articles.

From data to AI: Training the PredicTcell model

Anyone who has worked in bioinformatics understands that this is no easy feat that can be accomplished within a few months. Once this process took place, the team was able to work in parallel and create an MLOps framework that enables automated training, inference, monitoring, and retention. Once the initial stages of the effort were completed, the team was able to deliver an alpha version of the PredicTcell model, trained with traditional XGBoost methods and an ESM model, ultimately achieving 93-97% recall and 38-43% precision.

Additionally, expanding the dataset has given Tevogen’s scientific team new insights into the model training cycle, allowing them to refine the training method with each iteration. We continue to add additional features to our training set, such as using Agent Bricks in combination with biochemical properties to quickly evaluate expert papers with RAG integration.

Looking to the future: Unlocking the holy grail of medicine

As Beta training of the PredicTcell model begins and work begins on the Alpha version of the AdapTcell model, Tevogen.AI is uniquely positioned to create cutting-edge predictive models that increase the accuracy of peptide-protein binding affinities, the key to unlocking the holy grail of medicine.

Tevogen.AI believes that by using our proprietary model, we can achieve our ultimate goal of predicting binding peptides for novel or other proteins with very high accuracy.

“Adding determinism to stochastic workflows is the key to success. Balancing the trial-and-error process in vivo/in silico is something every biotech company should focus on in drug development,” said Mittul Mehta, CIO of Tevogen and Head of Tevogen.AI.

“We are very pleased with our relationships with Databricks and Microsoft. Each offers best-in-class capabilities that will enable continued innovation and help us achieve Tevogen’s goal of providing affordable and accessible treatments to large patient populations. We look forward to continuing to work with both of these great partners to revolutionize AI for drug development.”



Source link