Microsoft researchers introduce Syntheseus: a machine learning benchmark Python library for end-to-end retrosynthesis planning

Screenshot 2024-05-13 at 11.18.27 PM — https://arxiv.org/abs/2310.19796

Thanks to advances in machine learning, particularly generative models, there has been a resurgence of interest in computer automation of molecular design throughout the past five years. Although these methods can help find compounds with suitable properties more quickly, they do not take synthetic potential into account and often produce molecules that are difficult to synthesize in a wet lab. This is the driving force behind efficient CASP algorithms, the verification of the synthesizability of input molecules by retrosynthesis, and in particular the creation of synthesis paths.

The intersection of chemistry and machine learning has received a lot of attention in recent years. However, implementing state-of-the-art reaction models in practice poses significant challenges. These models are notoriously difficult to implement due to their various assumptions and input and output dependencies. Additionally, codebases designed primarily to reproduce benchmark results do not have easy-to-call entry points, further complicating the process.

To learn more, researchers from Microsoft, the University of Cambridge, Jagiellonian University, and Johannes Kepler University are investigating commonly used metrics for both one-step and multi-step retrosynthesis. It is unclear how end-to-end retrosynthesis pipeline measurements relate to measurements used for single-step and multi-step benchmarks in isolation. Previous research has shown that model comparisons and metric usage are uneven. This study aims to define best practices for evaluating retrosynthesis algorithms by thoroughly reevaluating and analyzing previous studies. The team introduced his Python library SYNTHESEUS to enable researchers to consistently evaluate their methods in this regard.

There are two main limitations to evaluation in retrosynthesis. First, while experimental validation is essential, academics working on algorithm development do not need to perform synthesis in the lab. This is because synthesis is expensive, time-consuming, and requires considerable expertise. The second problem is that most studies only consider one step rather than the entire retrosynthesis pipeline, due to the separation between single-step and multi-step. However, whether it gets adopted in the real world depends on how well it performs from start to finish.

The team integrated eight free and open source react models into one consistent interface, seven of which shared the same conda environment. The complexity of these codebases is well hidden, so comparing different types of models is as easy as a for loop.

To compare published numbers with numbers generated from this assessment, the team used the USPTO-50K dataset. This is because all the models they investigated provide results on this dataset. Due to its modest size, USPTO-50K may not accurately capture the distribution of all data. As a result, the team used its own Pistachio dataset, which includes over 15.6 million live reactions and 3.4 million samples after preprocessing, to demonstrate the general out-of-distribution of model checkpoints trained on USPTO-50K. evaluated. New to SYNTHESEUS? The default weights trained on USPTO-50K are immediately downloaded and cached by Syntheseus, so you don't need to search for model weights when you start. You can go back and retrain using larger or internal datasets.

Chemformer, GLN, Graph2Edits, LocalRetro, MEGAN, MHNreact, and RootAligned are some of the established single-step models that are reevaluated in this work. In the case of RetroKNN, researchers were able to receive code directly from developers. If we could not find an available checkpoint with a suitable data split, we trained a new model using the original training code, otherwise we used the specified checkpoint for all models.

They calculated the mean reciprocal rank (MRR) and top-k accuracy (k ≥ 50) while evaluating all models with n = 100 outputs. All models were run with a consistent batch size of 1. When managing larger batches, the batch size used for searches is typically fixed at 1, as searches are typically not parallelized and cannot be freely configured. Therefore, the maximum number of model calls performed during a search for a given time allocation is directly related to speed for a batch size of 1.

Note that two models (RootAligned and Chemformer) use the Transformer decoder to predict reactant SMILES from scratch, while the other model predicts a graph rewrite applied to the results. The former type of model performs well in terms of top-one accuracy across datasets and metrics, but graph transformation-based models perform better when k is large. The findings suggest that transformation-based models are more explicitly rooted in the set of changes that occur within the training data, and thus can provide more comprehensive coverage of the data distribution. Additionally, many of the presented USPTO-50K values are higher than those found in the literature when considering the top-k accuracy with k > 1 affected by deduplication. This also affects the ranking of some models. For example, GLN has lower top-1 accuracy than LocalRetro, which was previously claimed. Even if all the results were significantly worse, Pistachio maintains a surprising level of model ranking compared to his USPTO-50K. For example, when it comes to top 50 accuracy, none of the models improves more than 55%, while the USPTO achieves almost 100%. This is due to poor coverage of template-based models, but some of the template-free models evaluated here also generalize less well than their template-based counterparts. Observed. In conclusion, RetroKNN ranks 1st or near 1st in all metrics in both datasets and is one of the fastest models to reevaluate. Current single-step metrics are useful but insufficient to understand the performance of single-step models. Therefore, the researchers caution readers not to take this as a definitive suggestion.

The researchers also conducted search experiments that combined several single-step models and search algorithms. Their main focus is on modifying existing data, overviewing best practices, and introducing SYNTHESEUS. Therefore, these represent only preliminary multi-step results. However, the framework developed in this study paves the way for determining optimal end-to-end pipelines, a prospect that inspires excitement and hope, so there is great hope for the future. .

Results regarding tracking the discovery of the first solution and the maximum number of distinct routes recovered from the search graph are displayed. With the exception of Chemformer, GLN, and MHNreact, all search techniques can be used for most models by finding multiple independent paths to most targets. RootAligned achieves promising results in less than 30 calls on average (due to high processing costs).

Please check Paper and GitHub. All credit for this research goes to the researchers of this project.Don't forget to follow us twitter.Please join us telegram channel, Discord channeland linkedin groupsHmm.

If you like what we do, you'll love Newsletter..

Don't forget to join us 42,000+ ML subreddits

Dhanshree Shenwai is a computer science engineer with extensive experience in FinTech companies covering the fields of finance, cards and payments, and banking, with a keen interest in applications of AI. She is passionate about exploring new technologies and advancements in today's evolving world to make life easier for everyone.

✅ [Free AI Webinar] RAG Beginner's Guide by Professor Tom Yeh

Source link