Deriving general structure–activity/selectivity relationship patterns for different subfamilies of cyclin-dependent kinase inhibitors using machine learning methods

Machine Learning


Classification of compounds based on CDK inhibition activity

To discover the SAR patterns within the active and inactive sets of CDK1, CDK2, CDK4, CDK5, and CDK9 molecules, we developed five classification models using KNN, SVM, SKN and CPANN methods. These models classify the molecules into active and inactive groups and were constructed using the VIP-selected descriptors, effectively capturing the wide range of chemical diversity present in the dataset. The full names, brief definitions, and the subtypes of these descriptors are presented in Supplementary Tables S2–S6. The mean and standard deviation values within the active and inactive groups together with the Mann–Whitney p-values, for the selected descriptors are included in these Tables, as well. Most of the calculated p-values are below 0.05 and it shows the effectiveness of the VIP method in selecting discriminatory variables for developing binary active/inactive classifiers.

The statistical results for classification of molecules into active and inactive groups using CPANN and SKN models, along with the optimal parameters identified during the training, cross-validation, and testing processes, are concisely presented in Table 1. It is noteworthy that the statistical results for both CPANN and SKN methods are comparable and there is no superiority among them. The confusion matrices, applicability domains of the active/inactive classification models, and y-randomization results are presented in Tables S7, S8 and S9, respectively. The CPANN and SKN top maps together with the ROC curves and AUC values for five active/inactive classification models are shown in Supplementary Figs. S3 and S4, respectively. As can be seen, the similar molecules of the same activity classes were grouped together in the top maps and formed coherent classes. The trained SKN and CPANN models demonstrated logical separation between molecules based on their activity levels. High values of AUC were obtained (more than 0.87) which indicates the meaningful power of the developed models for discerning active molecules from inactive ones. It also shows the efficacy of the selected descriptors for building models. Moreover, the results on the ADs (see Table S8), together with the cross-validation, and the performance on an external validation test set collectively demonstrate the reliability of the predictions generated by the models in this section.

Table 1 The statistical results for the developed binary active/inactive classifiers for different CDK groups. The models were evaluated in terms of sensitivity, specificity, precision, accuracy, and Mathew-correlation coefficient (MCC).

As shown in Table S9, the error rates observed during the y-randomization test for both CPANN and SKN model are notably high across the training, cross-validation, and test sets. Also, the average AUC values for y-randomization test have decreased compared to the original models. This finding substantiates that the favorable outcomes attained by these models do not stem from fortuitous correlation, thus affirming the reliability of the developed models. According to the aforementioned results, the descriptors chosen utilizing the VIP method could proficiently represent the structural characteristics of molecules in correlation to their activity levels. To visualize the abstract space represented by the VIP-selected descriptors for active/inactive models, principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) algorithms were conducted, and the results are presented in Supplementary Fig. S5. The first three PCs explain 33.1%, 28%, 47.5%, 32.6%, and 34.5% of the total variance of the selected descriptors, for CDK1, CDK2, CDK4, CDK5, and CDK9 groups, respectively. The first three principal components account for approximately 30% of the total variance in the data and reveal significant clustering patterns within this level of explained variance. The inspection of Supplementary Fig. S5a–e shows an acceptable distinction between active and inactive molecules in PC spaces. The dispersions of the molecules in the tSNE spaces show similar separation patterns. This separation is prominently evident in both PC and tSNE plots for the CDK2, CDK5, and CDK9 groups. The results obtained from these plots are consistent with the findings from the CPANN and SKN models, confirming that the selected descriptors encode valuable information about the activity levels of the molecules. For further validation, comparisons were made with other linear and non-linear models, including KNN and SVM, and the results are provided in the supplementary Table S10. It can be observed that the results obtained with SKN and CPANN surpass those of the KNN and SVM models.

Common VIP descriptors in active/inactive classification models

As mentioned in the previous sections, we utilized the VIP approach to choose descriptors for the classification of active and inactive molecules within the five groups of CDK molecules. The presence of common descriptors across the discussed models implies their significance for describing CDK inhibitory activity. A Venn diagram showing common molecular descriptors for classifying active and inactive molecules for five groups of CDK molecules is shown in Fig. 1. The list of common descriptors between five active/inactive models are shown in Supplementary Table S11. As can be seen in Fig. 1 and Table S11, two molecular descriptors of Hy and nRCONHR appeared in active/inactive models for three groups of CDK1, CDK2, and CDK5. Hy represents the hydrophilic factor. This descriptor is defined by the following equation53:

$$Hy = \frac{{(1 + N_{Hy} ) \cdot log_{2} (1 + N_{Hy} ) + N_{C} \cdot \left( {\frac{1}{A} \cdot log_{2} \frac{1}{A}} \right) + \sqrt {\frac{{N_{Hy} }}{{A^{2} }}} }}{{log_{2} (1 + A)}}$$

(5)

where \(N_{Hy}\) is the number of hydrophilic groups (–OH, –SH, –NH). \(N_{C}\) is the number of carbon atoms and A is the number of hydrogen atoms53. The nRCONHR is the number of secondary amides (aliphatic)54. To gain a more comprehensive understanding of the distribution of the selected descriptors among different groups of active and inactive molecules, density, box, and beeswarm plots were employed. The distributions of the values of Hy across active and inactive molecules within the CDK1, CDK2, and CDK5 groups are shown in Fig. 2a–c. Upon examining this figure and Tables S2, S3, and S5, it becomes evident that the average values of Hy for the active molecules are higher than those of the inactive molecules for three CDK targets. This observation suggests that the increasing the value of this descriptor tends to shift molecules from an inactive to an active state. Qian et al. also reported in their recent study that an increase in this feature (Hy) is related to the activity of CDK inhibitors16. Figure 2d–f vividly illustrates the distribution of the values of nRCONHR within the active and inactive molecules for three groups of CDK1, CDK2, and CDK5 isoforms. Examining this figure and Tables S2, S3, S5 shows that average values of nRCONHR for active molecules are significantly higher than inactive molecules for the three mentioned targets. This observation suggests that the hydrophilic factor and the number of aliphatic amides of the second type may provide insights into designing active CDK1, CDK2, and CDK5 inhibitors. The molecular descriptor of C-043 appeared in three active/inactive models for CDK1, CDK4, and CDK9 groups. The C-043 shows the atom-centered fragment X–CR..X, where R shows any group linked through carbon, and X demonstrates any electronegative atom such as O, N, S, P, Se, and halogen55. In this notation, “–” denotes a single bond connecting the atom-centered fragment X with the central carbon atom (C), while “..” indicates a bond involving a delocalized or unspecified number of atoms or groups between the two X atoms46,53. Figure 2g–i illustrate the distribution of C-043 for CDK1, CDK4, and CDK9 active and inactive molecules. The average values of the C-043 descriptor are higher for active CDK4 and CDK9 inhibitors compared to their inactives, while the average value of this descriptor for the inactive CDK1 molecules is higher than those for active ones (see Fig. 2g–i and Tables S2, S4, and S6). This result suggest that this specific descriptor is likely to be effective in the design of selective CDK4 and CDK9 inhibitors. The descriptor nBnz, represents any aromatic rings, including both carbocyclic benzene and heterocyclic aromatic systems such as thiophene, pyridine, pyrazole, and others53. This descriptor was present in both active/inactive classification models for CDK4 and CDK5 isoforms. Examination of Fig. 3a–c and Tables S4 and S5 reveals that the numerical values of the nBnz are excessively high for inactive molecules of CDK4 and CDK5. Moreover, the mean values of this descriptor are comparatively lower for active inhibitors of CDK4 and CDK5 compared to inactive molecules. It suggests that increasing nBnz values may inversely impact the inhibition of CDK4 and CDK5 active molecules. In a recent research on CDK2 inhibitors, a benzene ring structure selected from Morgan fingerprints, showed a negative contribution to the model performance based on SHAP values for explaining the activity16. This emphasizes on the general importance of Benzene rings on the activity state of the CDK inhibitors and suggest lower benzene rings for better inhibition effect. The molecular descriptors TPSA(Tot), O-060, and nArOR are common between the features of CDK2 and CDK5 active/inactive models. TPSA(Tot) indicates the distribution of topological polar surface area using polar contributions N, O, S, P56. Figure 3d–f show the distribution of TPSA(Tot) among the CDK2 and CDK5 active and inactive molecules. The significantly higher numerical values of TPSA(Tot) in active CDK2 and CDK5 molecules suggest that this descriptor positively influences the activity of molecules for inhibiting these targets. Notably, a recent study had identified TPSA as an influential parameter for describing the activity of CDK2 inhibitors16. They showed that the TPSA, with an average positive SHAP value had a positive impact on the predicted outcome in the random forest (RF) and Deep Neural Network (DNN) models for modeling the activity of CDK2 inhibitors16. This is in harmony with our finding on higher average values of this descriptor within the active groups of CDK2 and CDK5 groups, compared to inactive ones. The density plot, box plot, and beeswarm plots of the distributions of the other common descriptors are shown in Figs. S6–S12 in the supplementary material.

Figure 1
figure 1

The Venn diagram of the VIP-selected descriptors used for developing active/inactive classifiers for five groups of CDK molecules: CDK1: blue, CDK2: crimson, CDK4: green, CDK5: yellow, CDK9: brown.

Figure 2
figure 2

The density plot, box plot, and beeswarm plot (ac) for Hy molecular descriptor for the active and inactive groups of CDK1, CDK2, and CDK5 molecules (df) for nRCONHR molecular descriptor for the active and inactive groups of CDK1, CDK2, and CDK5 molecules (gi) for C-043 molecular descriptor for the active and inactive groups of CDK1, CDK4, and CDK9 molecules.

Figure 3
figure 3

The density plot, box plot, and beeswarm plot (ac) for nBnz molecular descriptor for the active and inactive groups of CDK4, and CDK5 molecules (df) for TPSA(Tot) molecular descriptor for the active and inactive groups of CDK2 and CDK5 molecules.

To compare the ten groups of active and inactive molecules for CDK1, CDK2, CDK4, CDK5, and CDK9, the VIP-selected molecular descriptors were pooled from all five active/inactive classifiers, leading to the identification of a common set of 77 descriptors. The molecular descriptors for the ten groups were collected and scaled together. Scaling the data for all molecules together ensures a valid visualization since the molecules in newly scaled space can be compared together. The two- and three-dimensional PCA score and tSNE plots are shown in Supplementary Figs. S13 and S14, respectively. The first three PCs only explains 15.61% of the variance in the entire dataset, indicating a relatively low percentage of the variance coverage. A general view of the PCA plots shows overlapping between majority of the groups but detailed inspection exhibited reasonable clustering for some specific classes, including active molecules of CDK4, CDK5 and CDK9. In terms of the tSNE plots, the active CDK2, CDK4, and CDK9 groups show some distinct clustering patterns. The other groups of molecules do not exhibit suitable clustering patterns in both PCA and tSNE maps. This observation suggests a need for developing a distinct multiclass classifier for discriminating between active CDK inhibitors. This will lead to find the specific subspaces in chemical space in which the molecules will be clustered there based on their therapeutic targets. This modeling approach is discussed in subsequent section “Classification of compounds based on therapeutic targets”.

Classification of compounds based on therapeutic targets

In this section, 5011 active inhibitors were classified into five groups based on their therapeutic targets. This data includes 773, 1440, 875, 650 and 1273 active molecules for CDK1, CDK2, CDK4, CDK5, and CDK9 targets. Common compounds were removed between different groups of active molecules, and the models were developed based on molecules uniquely associated with each target. We classified active inhibitors based on their therapeutic targets and developed four multi-class classification models using KNN, SVM, CPANN and SKN approaches. To construct these models, 31 descriptors were selected using the VIP method. Tables S12, S13, and S9 provide a comprehensive overview of the statistical results, confusion matrix, AD, and y-randomization results for the CPANN and SKN models. The result in Table S8 shows that most molecules are within the ADs of the models indicating the reliability of the prediction made by these models. Both SKN and CPANN classification methods achieved prediction accuracy values exceeding 0.80. Detailed inspection of the results shows that the SKN model achieved better accuracies compared to the CPANN model. The SKN top map shown in Fig. S15 illustrates the optimized SKN model for classifying active inhibitors. As can be seen in this figure, the active molecules within each class fall within the same group and all five groups show strong clustering patterns on the map. For further evaluation, the results of SKN and CPANN models were compared with those of KNN and SVM approaches (see Table S14). This shows superior performance of SKN and CPANN compared to those of KNN and SVM approaches.

The abstract PCA space built using 31 VIP-selected descriptors is presented in Fig. S16. The first three PCs cover 24.1% of the whole variance of the selected descriptors. Since there are five groups of active molecules, it is hard to clearly see how they cluster in a three-dimensional score plot. To achieve clearer visualization, one-against-one PCA score plots for different active CDK inhibitors are provided in Fig. 4. Each section of this figure reveals clear separation between active CDK inhibitors in the subspace made by VIP-selected descriptors. Moreover, the tSNE plot using 31 descriptors is shown in Fig. S17. The active CDK4, CDK2 and CDK9 make clear clusters in the tSNE space of the 31 selected descriptors and a relative meaningful cluster is also seen for CDK1 group. The results of PCA and tSNE analysis on the selected descriptors suggest that the developed abstract space maps (i.e. PCA and tSNE) can be used for future drug design and synthesis projects as guiding tools to help the medicinal chemists focusing on compounds projected on specific cluster with desirable selectivity toward specific CDK targets.

Figure 4
figure 4

One-against-one visualization of the abstract PC space made by the 31 VIP-selected molecular descriptors used for multiclass classification of active CDK molecules (Dark blue: active CDK1, magenta: active CDK2, light blue: active CDK4, gold: active CDK5, orange red: active CDK9).

VIP selected descriptors in active classification models

The set of 31 VIP-selected descriptors for classification of active CDK inhibitors is provided in Table S15. This table comprises the full names, concise definitions, subtypes, mean values, standard deviations, and Kruskal–Wallis test p-values for these descriptors across five groups of active CDK inhibitors. Notably, all calculated p-values are below 0.05, showing the significant differences of the selected descriptors between five CDK groups. Among the 31 descriptors, the three descriptors with highest VIP values are described here and the distribution of the remaining features are presented in Supplementary Fig. S18. The first descriptor, Ms, represents the mean electrotopological state, which is associated with the polarity of atoms and steric accessibility. The distributions of this molecular descriptor in active inhibitors are shown in Fig. 5a–c. As can be seen in this figure, the distribution of the Ms for CDK4 group is skewed to lower values. Moreover, the distribution of this descriptor for CDK2 and CDK5 are inclined to higher values. This suggests that a molecule with high Ms value tends to potentially inhibit the CKD2 and CDK5 targets, while it is being less likely to inhibit a CDK4 target. As can be seen in the box plot (Fig. 5b), CDK2 and CDK9 have higher median values and CDK1 and CDK2 have wider interquartile ranges, indicating a greater variability in Ms values. The beeswarm plot (Fig. 5c) complements the box plot by displaying individual data points, avoiding overlap, and providing a clearer view of the data distribution. The next important descriptor is TPSA(Tot). The distributions of this descriptor across the active CDK inhibitors are shown in Fig. 5d–f. The value of this molecular descriptor is meaningfully higher for CDK5 inhibitor compared to other groups. Moreover, the distribution of this descriptor for CDK9 group is intensely skewed to lower values. This suggests that increasing and decreasing the value of TPSA(Tot) can potentially induce selectivity toward CDK5 and CDK9 targets, respectively. The next important descriptor is nCbH, representing the number of unsubstituted benzene rings. The nCbH descriptor exhibits higher values for the active CDK2 group compared to other groups. The positive correlation between this descriptor and activity highlights its significance in enhancing the effectiveness of the CDK2 active group, making it valuable for the design of selective drugs. The CDK2 and CDK4 model were further evaluated to assess the SAR patterns identified in this study. The detailed results of this evaluation are provided in Appendix B in the supplementary material.

Figure 5
figure 5

The density plot, box plot, and beeswarm plot (ac) for Ms (df) for TPSA(Tot) (gi) for nCbH molecular descriptor(s) within five different groups of active CDK inhibitors.

Two distinct reports about the design of selective CDK inhibitors were found in the literature authored by Qian et al.16 and Liang et al.17. The former study described the BiLAT framework and compared it with machine learning models, including random forest (RF), SVM, and XGBoost, as well as deep learning models such as DNN, and Attentive FP16. Their work involved a dataset comprising 17,986 compounds targeting CDK1-2, CDK4, and CDK 5–9, sourced from PubChem, ChEMBL, and Binding DB16. They further improved model performance through data augmentation16. The study focused on 20 descriptors for CDK2, including hydrophobic and hydrophilic effects (SlogPVSAk), TPSA, and benzene ring structure (Morgan_389 and Morgan_1199)16. The research employed interpretation tools like self-attention module and SHAP method to visualize and elucidate key substructures affecting CDK inhibitor activity16. The results showed that the multitask BiLAT model outperformed the single-task version, with data augmentation proving beneficial for both single-task and multitask models16. In the later work, the primary objective was to rapidly identify selective CDK4/6 kinase inhibitors featuring a novel structural framework by ensuring effective target specificity while minimizing off-target effects17. The authors employed a multitask model, comprising a GENERATOR and a PREDICTOR, to create and evaluate potential compounds for their selectivity toward CDK4/6 kinase17. Using a dataset of 1000 known CDK4/6 inhibitors, the researchers trained and tested their models, achieving impressive results17. The study employed various algorithms, including support vector classification (SVC), RF, naive Bayes (NB), KNN, and multi-layer perceptron (MLP), within the PREDICTOR model to evaluate the potential activity of generated compounds17. Notably, the study identified seven molecules with high selectivity scores, highlighting the potential of this method to efficiently generate novel CDK4/6 inhibitors and aiding researchers in early-stage drug discovery17. Some VIP-selected descriptors in our work also appeared in previous studies (i.e. TPSA, benzene ring structure, hydrophobicity) emphasizing the invaluable roles of these parameters for describing activity/selectivity of CDK inhibitors. The major novelty of our work compared to previous studies relies on development of the abstract chemical spaces using selected descriptors as visual activity/selectivity maps. The molecules with different activity levels and therapeutic targets were distinctly clustered in these maps and it made it much easier to interpret the developed multivariate models. Moreover, we distinctly compared the distribution of the important descriptors across different groups of molecules using non-parametric significant tests, which was not achieved before. The activity/selectivity maps together with significant tests provided versatile tools for interpreting SKN and CPANN models in this work. Furthermore, our work explored the practical usage of the developed models for virtual screening of a random subset of PubChem database including ~ 2 million molecules. Screening of such a large database was not reported by Qian et al. and Liang et al. in their reports. We also conducted an extensive comparison of SAR patterns within and between active and inactive CDK groups. The specific methodology and approach we employed, along with our thorough analysis of a substantial dataset, contributed significantly to the novelty and importance of our research in the field.

Virtual screening

Herein, we aim to demonstrate the practical applicability of the models developed in our study for the virtual screening of a large database. For this purpose, we randomly selected 2 million molecules from the PubChem database (see section “PubChem database”) which acted as the background database against which the active inhibitors were screened. The active inhibitors from each CDK group were added to the PubChem-R database, and the optimized classifiers were used to retrieve them by sorting the database.

The ROC curves for screening the PubChem-R database using the CPANN and SKN active/inactive classifiers are shown in Fig. S19a,b, respectively. Additionally, Table S16 provides the calculated EF and AUC values for these models. This performance is attributable to the fact that the majority of active CDK4 inhibitors are concentrated within the top 1% of the sorted database. The ROC curves for screening of PubChem-R database using SKN and CPANN models developed for discriminating active molecules are presented in Fig. S19c,d. The calculated EF and AUC values for these models are summarized in Table S17. The AUC, EF1%, and EF10% values for CDK4 and CDK9 groups are superior over other groups of active molecules. Figure 6 illustrates the distributions of the PubChem-R molecules and active CDK inhibitors in the chemical space made by distinct pairs of the VIP-selected molecular descriptors. Inspection of these figures reveals that the active CDK inhibitors are densely concentrated in specific regions of the selected chemical space of the PubChem database. As can be seen in Fig. 6, the active CDK4 molecules show higher values of topological distances between two nitrogen atoms (i.e. T(N..N)) in their structure compared to CDK2, CDK5, and CDK9 active molecules. Moreover, the values of the mean topological state (Ms) for CDK9 active molecules are significantly larger than those of CDK4 active molecules. These observations shed light on the distinct binding characteristics of CDK4 and CDK9, offering valuable information for drug design. The concept of “Chemography”, which involves exploring regions in chemical space occupied by molecules that bind to specific targets, was introduced in 2004 by Lipinski and Hopkins57. Figure 6 illustrates how this concept is achieved in this work where active CDK2, CDK4, CDK5 and CDK9 molecules are clustered in specific regions of the chemical space defined by VIP-selected descriptors. A significant difference exists between our results and the study reported by Qian et al.16 from the perspective of virtual screening. They assessed the model’s performance using Tox2158, SIDER59, and ClinTox60 databases, which consist of 7401, 1421, and 1377 compounds (10,199 in total)16, respectively. In contrast, our research involved virtual screening of an extensive dataset comprising 2,000,000 molecules randomly sourced from the PubChem database. To the best of our knowledge, this work is the first study to report the screening of a large database for retrieving active CDK inhibitors.

Figure 6
figure 6

The distribution and coordinates of the PubChem-R and active CDK molecules within the two-dimensional subspaces created by one-by-one comparison of four VIP-selected molecular descriptors (a) TPSA(Tot) versus T (N.. N), (b) TPSA(Tot) versus Ms, (c) TPSA(NO) versus T (N.. N), (d) TPSA(NO) versus T (N.. N).



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *