Performance and robustness of small molecule retention time prediction using molecular graph neural networks in industrial drug discovery campaigns.

Machine Learning


Our study is based on 7552 compounds representing a diverse set of chemicals accumulated over several years of drug discovery campaigns. This dataset has different characteristics from the public benchmark dataset METLIN SMRT (Figure 1), which was a milestone in RT prediction and facilitated improved resolution of RT prediction tasks.8, 9, 10, 11, 12, 13.

Figure 1
Figure 1

Kernel density estimation visualizes the distribution of observations in two datasets: Amgen's proprietary dataset and the public METLIN SMRT dataset. The retention times and five calculated descriptors are shown to illustrate the differences between the two datasets.

Although the importance of large public datasets cannot be underestimated, such datasets have implicit biases and limitations that can reduce transferability when models are later trained on non-standard datasets. It is important to be aware that this can lead to a decline in14.

We trained a series of different models in combination with three sets of descriptors. Extended connectivity fingerprints (ECFPs) are a set of binary substructure-based features that represent the presence or absence of distinct chemical substructures within a molecule. A set of 200 RDKit descriptors (i.e., a wide range of calculated physicochemical properties) from the DeepChem Python library and ChemAxon LogD range at different pH values.It has been previously shown that the calculated LogD correlates well with RT15,16. Four model types were considered: (1) XGBoost3, gradient boosted tree. (2) AttentiveFP, a molecular graph neural network with attention mechanism;17(3) Fully connected neural network (FCNN). (4) ChemProp, a molecular graph neural network based on: instructed message passingFour. XGBoost and ChemProp were each combined with three descriptor sets: ECFP4, RDKit descriptors, and LogD. AttentiveFP relies only on the molecular graph representation and cannot utilize additional descriptors. Furthermore, in the original report by Domingo-Almenara et al., he included his FCNN applied to the METLIN SMRT dataset.8 Model evaluation was performed based on five-fold cross-validation with hyperparameter optimization and is reported in Tables 1 and 2. Molecular graph neural network models (AttentiveFP and ChemProp) outperformed XGBoost and FCNN. The best performing model based on the validation schema was ChemProp combined with RDKit descriptors.

Table 1 Performance of common models.
Table 2 Statistical post-hoc tests, multiple comparisons of RT prediction models.

Individual drug discovery campaigns typically navigate different chemical spaces and explore a range of chemicals based on hit matter identified through different methods (e.g., screening DNA-encoded libraries). This can be a challenge for ML models, as the historical data on which they are trained can be significantly different from the currently explored chemical space. To be of practical use in drug discovery campaigns, models need to generalize well to such unknown chemical spaces. To address this issue regarding time-dependent performance decay, we next sought to verify the robustness of our model by training it based on temporally partitioned data (rather than scaffold partitioning). To do this, we designed a new training plan for the model that splits the data according to acquisition time. The data is sorted according to acquisition date and divided in half, the first half (T0) is used to train the model and the second half is again divided into 10 equal bundles (T1-T10) for the analysis of the chemical reaction of interest. It represents a change over time. Reduction of chemical similarity from training data (Figure 2). This training regime closely reflects the changing priorities and interests of ongoing drug discovery projects, where the focus is on new targets and new chemical series with different properties. Again, molecular graph-based models (ChemProp and AttentiveFP) outperformed XGBoost and FCNN (Figure 3). In particular, when ChemProp is combined with his RDKit descriptors, it appears to be very robust over time (Fig. 3a,b). Therefore, the RT model based on ChemProp with RDKit descriptors is found to be accurate and robust for solving RT prediction tasks.

Figure 2
Figure 2

T0 Nearest neighbor Tanimoto similarity for each compound at each time division (T1 to T10) from the training dataset. Tanimoto similarity calculated based on ECFP4-1024 fingerprint.

Figure 3
Figure 3

Performance of RT model trained on T0 time split and evaluated on 10 time-series derived time splits (T1 to T10). Panel (beb) Shows the performance of the model at each point in time on the x-axis and on the R time division.2 MAE (seconds) is displayed on the Y-axis. Panel (cd) Boxplots comparing models aggregated across all time points.The model is shown on the X-axis and is marked R2 MAE (seconds) on the y-axis of the panel (c) and (d), Each.

Next, we decided to investigate its applicability in a broader context by predicting RT for the METLIN SMRT dataset. The METLIN SMRT dataset differs significantly from ours in terms of both chemical diversity and chromatography system (Figure 1). Figure 4 shows the relationship between actual and predicted RT for a ChemProp model (using RDKit descriptors) trained on the METLIN SMRT dataset. The ChemProp model was able to accurately predict RT with a mean absolute error (MAE) of 38.7 s, RMSE of 67.50 s, and R.2= 0.84, which is equivalent to recently reported MAE scores of 34-39 seconds9, 10, 11, 12, 13. It is important to note that our evaluation is based only on chromatographically retained compounds in the METLIN SMRT dataset.

Figure 4
Figure 4

ChemProp model with RDKit descriptors trained on the METLIN SMRT dataset. Scatterplot showing predicted RT (in seconds) versus actual RT (in seconds) for compounds retained in the test split.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *