Performance and robustness of small molecule retention time prediction using molecular graph neural networks in industrial drug discovery campaigns.

Our study is based on 7552 compounds representing a diverse set of chemicals accumulated over several years of drug discovery campaigns. This dataset has different characteristics from the public benchmark dataset METLIN SMRT (Figure 1), which was a milestone in RT prediction and facilitated improved resolution of RT prediction tasks.^{8, 9, 10, 11, 12, 13}.

Although the importance of large public datasets cannot be underestimated, such datasets have implicit biases and limitations that can reduce transferability when models are later trained on non-standard datasets. It is important to be aware that this can lead to a decline in¹⁴.

We trained a series of different models in combination with three sets of descriptors. Extended connectivity fingerprints (ECFPs) are a set of binary substructure-based features that represent the presence or absence of distinct chemical substructures within a molecule. A set of 200 RDKit descriptors (i.e., a wide range of calculated physicochemical properties) from the DeepChem Python library and ChemAxon LogD range at different pH values.It has been previously shown that the calculated LogD correlates well with RT^15,16. Four model types were considered: (1) XGBoost³, gradient boosted tree. (2) AttentiveFP, a molecular graph neural network with attention mechanism;¹⁷(3) Fully connected neural network (FCNN). (4) ChemProp, a molecular graph neural network based on: instructed message passing^Four. XGBoost and ChemProp were each combined with three descriptor sets: ECFP4, RDKit descriptors, and LogD. AttentiveFP relies only on the molecular graph representation and cannot utilize additional descriptors. Furthermore, in the original report by Domingo-Almenara et al., he included his FCNN applied to the METLIN SMRT dataset.⁸ Model evaluation was performed based on five-fold cross-validation with hyperparameter optimization and is reported in Tables 1 and 2. Molecular graph neural network models (AttentiveFP and ChemProp) outperformed XGBoost and FCNN. The best performing model based on the validation schema was ChemProp combined with RDKit descriptors.

Table 1 Performance of common models.

Table 2 Statistical post-hoc tests, multiple comparisons of RT prediction models.

Individual drug discovery campaigns typically navigate different chemical spaces and explore a range of chemicals based on hit matter identified through different methods (e.g., screening DNA-encoded libraries). This can be a challenge for ML models, as the historical data on which they are trained can be significantly different from the currently explored chemical space. To be of practical use in drug discovery campaigns, models need to generalize well to such unknown chemical spaces. To address this issue regarding time-dependent performance decay, we next sought to verify the robustness of our model by training it based on temporally partitioned data (rather than scaffold partitioning). To do this, we designed a new training plan for the model that splits the data according to acquisition time. The data is sorted according to acquisition date and divided in half, the first half (T0) is used to train the model and the second half is again divided into 10 equal bundles (T1-T10) for the analysis of the chemical reaction of interest. It represents a change over time. Reduction of chemical similarity from training data (Figure 2). This training regime closely reflects the changing priorities and interests of ongoing drug discovery projects, where the focus is on new targets and new chemical series with different properties. Again, molecular graph-based models (ChemProp and AttentiveFP) outperformed XGBoost and FCNN (Figure 3). In particular, when ChemProp is combined with his RDKit descriptors, it appears to be very robust over time (Fig. 3a,b). Therefore, the RT model based on ChemProp with RDKit descriptors is found to be accurate and robust for solving RT prediction tasks.

Next, we decided to investigate its applicability in a broader context by predicting RT for the METLIN SMRT dataset. The METLIN SMRT dataset differs significantly from ours in terms of both chemical diversity and chromatography system (Figure 1). Figure 4 shows the relationship between actual and predicted RT for a ChemProp model (using RDKit descriptors) trained on the METLIN SMRT dataset. The ChemProp model was able to accurately predict RT with a mean absolute error (MAE) of 38.7 s, RMSE of 67.50 s, and R.²= 0.84, which is equivalent to recently reported MAE scores of 34-39 seconds^{9, 10, 11, 12, 13}. It is important to note that our evaluation is based only on chromatographically retained compounds in the METLIN SMRT dataset.

Source link

create binance account commented on Telco leaders join forces to discuss next steps towards highly autonomous networks: Your point of view caught my eye and was very inte
最佳Binance推荐代码 commented on New Microsoft Teams App is Now Available: I don't think the title of your article matches th
"oppna ett binance-konto commented on Why the Apple UK hiring spree “makes sense” for the company: Your article helped me a lot, is there any more re
Реферальная программа binance commented on Amazon, Google Among Firms Focusing on AI Lobbying in States: I don't think the title of your article matches th
slotvip commented on Apple and Salesforce respond to YouTube video complaints: What's up to all, it's actually a good for me to p

Performance and robustness of small molecule retention time prediction using molecular graph neural networks in industrial drug discovery campaigns.

Leave a Reply

RECENT POSTS

Chinese tech giants threaten record salaries to woo Singapore-trained AI graduates

A step-by-step guide to building and comparing FedAvg and FedProx Federated Learning on non-IID CIFAR-10 using NVIDIA FLARE

Best Image-to-Video AI Generators of 2026: Tested and Ranked

Related Posts

Leave a Reply