Growing strings in chemical reaction space to explore retrosynthetic routes

Machine Learning


Retrosynthesis is the process of designing a synthetic route for a desired target molecule and involves identifying the optimal strategy to combine simpler molecules into the target product.1. Retrosynthesis often requires a series of reaction steps to synthesize these molecules from simpler precursor molecules. One of the main challenges in this process is to explore a large retrosynthetic hypergraph representing all possible synthetic routes for a particular target molecule.2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26, 27, 28, 29, 30. Pathways in the tree connect the target product (i.e., root) and all commercially available compounds (i.e., leaves) and are identified by the algorithm through single-step cutting.

A retrosynthesis tree grows exponentially because each retrosynthesis step can branch into multiple choices, and the number of possible roots increases exponentially with the depth of the tree. Therefore, even for relatively small molecules, the sheer number of potential synthetic routes can be overwhelming to explore using classical mathematical kernels.

Exploring such a hypergraph requires implementing certain criteria to effectively filter a wide range of cutting options. One strategy relies on path evaluation and scoring based on the confidence of each single-step retrosynthesis prediction, which can be evaluated independently. The idea is to consider unreliability as an indicator of synthetic routes that are risky and most likely to fail. Therefore, unreliable steps are not propagated further and more reliable steps are given priority.14. Apart from filtering based on confidence values, single-step predictions that fail round-trip checks can also serve as additional targets to avoid further expansion. This check consists of applying a forward prediction model to the output of a single retrostep prediction and verifying whether the result of this operation returns the desired product.14. Other schemes allow scoring and ranking different options, for example, according to the availability of the molecule, the corresponding cost, or some indicator related to green chemistry.31.

Nevertheless, most of the existing approaches only use local information obtained from single-step retrosynthesis and do not consider the strategic decisions typical of multi-step synthesis conceived by human experts. . Indeed, when performing multi-step synthesis, various strategic decisions can be made to streamline the process and optimize efficiency. One example is the strategic protection of functional groups. By selectively protecting certain groups early on, chemists can prevent undesired reactions and ensure that desired transformations occur smoothly. In an ideal case, a chemist would introduce a variety of protecting groups and in the final step he would aim to remove all protection in one step, saving time and minimizing the risk of side reactions. I follow this idea. Another strategy involves utilizing robust, high-yielding reactions for key transformations, which can significantly impact overall yields and simplify synthetic routes. . In addition, strategic retrosynthetic cleavage plays an important role in planning a series of reactions. By identifying strategic bond breaks, chemists can design efficient synthetic routes and target specific intermediates or building blocks to assemble the final product. Finally, the selection of reagents, catalysts, and reaction conditions are also strategic considerations. Choosing appropriate reaction parameters can improve selectivity, increase yield, and speed up the overall synthetic process. Collectively, these strategic decisions contribute to the successful execution of complex multistep syntheses.32. Existing models rely on single-step predictions and therefore lack a comprehensive understanding of the key strategies used in multi-step retrosynthesis.

Recently, Thakkar et al.29 described an approach aimed at improving retrosynthetic prediction systems by giving chemists more control over the cuts that occur during retrosynthesis tree exploration. This method enables user-defined cuts and creates a “human-involved” component that combines expert knowledge and deep learning. Their approach increases the diversity of predicted amputation. With their method, they can improve decision-making strategies, enhance the chemist experience, and increase user engagement that statistical and machine learning algorithms alone cannot encode due to insufficient training data and resulting model bias. promoted. Another recent approach to improving search policies introduces the concept of goal-driven synthesis planning, which optimizes multi-step synthesis routes towards specific components, based on reinforcement learning.33.

Furthermore, Chen et al.twenty one introduced a method called “retro*” that uses an innovative neurally guided tree search approach for chemical retrosynthesis planning. Their method uses an A-like planning algorithm guided by a neural network trained on past retrosynthesis planning experience. Their neural network learns the synthesis cost of each molecule and assists the search algorithm in selecting the most promising molecular nodes for expansion. Furthermore, the study of Ishida et al. present “ReTReK,” a data-driven computer-aided synthesis planning (CASP) application that integrates retrosynthesis knowledge into the evaluation of search directions. We show that ReTReK successfully explores promising synthetic routes by incorporating tunable parameters based on retrosynthesis knowledge and favors routes designed based on that knowledge. This study addresses the limitations of existing data-driven CASP applications and enhances current and future data-driven CASP applications by introducing a rule-based approach and evaluating performance using drug-like molecules. This shows the potential of ReTReK.

Recently, Pasquini and Stenta30 We introduced LinChemIn, a toolkit that simplifies the manipulation of reaction networks, enhances capabilities for manipulating synthetic routes, and facilitates the interaction between AI and human expertise in chemical analysis.

For a comprehensive review, we refer the reader to the evaluation provided by Zhong et al.34 and Jiang et al.35.

However, the problem of designing chemical retrosynthesis is much more complex than removing potential biases from single-step retrosynthesis models or directing routes to specific precursors. It requires knowledge, experience, and a degree of creativity and intuition that goes beyond the state-of-the-art of existing retrosynthesis algorithms. Similar to strategy games, evaluating multi-step solutions requires holistic planning and can be carried out more effectively by considering a sequence of steps rather than focusing solely on individual steps. . Therefore, relying solely on the reliability of a single-step model to devise synthetic routes may miss important pathways, leading to suboptimal or even erroneous predictions.

Our algorithm not only addresses strategic decision making, but also extends its impact to improve the efficiency of the separation step in multi-step synthesis plans. Unlike traditional approaches that rely only on single-step retrosynthetic models, our method introduces a strategy that assembles single-step predictions into a coherent retrosynthetic pathway.

By considering the entire sequence of steps, our approach provides a broader view of the synthesis process, including the final step where separation efficiency is critical.

In this study, we present an algorithm that emulates human strategic decision-making in building AI-driven retrosynthesis approaches. This computational technique facilitates traversal of the retrosynthesis tree. Retrosynthesis trees are constructed using traditional single-step machine learning predictions, leveraging chemical knowledge from a collection of existing multi-step retrosynthesis. In doing so, the algorithm effectively utilizes the expertise of human chemists and available knowledge that is readily accessible through retrosynthesis published in the literature. The proposed method targets the task of efficiently assembling single-step retro predictions and does not require retraining of the retrosynthesis model as it leverages existing pre-trained models. The algorithm focuses on representing a sequence of chemical steps using embeddings. The sequence of predictions is then compared to the sequence of steps in existing datasets to prioritize retrosynthesis strategies. To represent a single-step chemical reaction, we utilize the work of Schwaller et al.36used language model embeddings to construct chemical reaction fingerprints (rxnfp). Such reaction fingerprints capture structural and chemical properties such as reactants, products, reaction context, and stereochemistry. These embeddings have proven to be very successful in relating chemical reactions to specific reaction classes.37to predict reaction yield38 Or even discovering a new Heck reaction39.

Here we extend the concept of chemical reaction fingerprints to retrosynthetic routes and represent the sequence of steps involved in a published retrosynthesis as a set of multidimensional strings in fingerprint space. The core idea of ​​the proposed algorithm is to construct a decomposition tree by growing a string that minimizes the distance between the predicted string and each section of an existing multidimensional string in the embedding space. That's it. This comparison can be extended to more complex scenarios where the retrosynthesis is not a linear trajectory but a tree described by corresponding branching structures in the fingerprint space.

This method shows superior performance in terms of retrosynthesis with fewer steps and determines the protection/deprotection of functional groups over the entire length of the synthesis. The proposed approach makes better use of the variety of reactions available, directing steps that can occur under milder reaction conditions and reducing the effects of strong chemicals, such as organometallics and strong oxidants, for example. Avoid the need. In the results section, we provide some applications that demonstrate the potential of this methodology.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *