AAGP integrates physicochemical and compositional features for machine learning-based prediction of anti-aging peptides

Machine Learning


Amino acid composition and dipeptide composition

The AACs for DS1_Main, DS2_Main, DS1_Indp, and DS2_Indp are illustrated in Fig. 2A,B. In general, the AACs of the main dataset (Fig. 2A) and the independent test dataset (Fig. 2B) are consistent. It can be observed that anti-aging peptides are enriched in glycine residues compared to both random and antimicrobial peptides across the training and test datasets, consistent with previous literature findings13. The role of Gly in anti-aging has been reported and discussed in the literature; for example, glycine helps increase lifespans and improve health in both rodent and mammalian models by activating autophagy and mimicking methionine restriction53. In addition to Gly, there are higher compositions of Gln, Met, and Pro in the anti-aging peptides when compared to the random peptides and the antimicrobial peptides. These amino acids are associated with several discriminating patterns of dipeptides, as illustrated in Supplementary Fig. S2. For example, dipeptides of MQ, GP, and PG are more abundant in positive sequences compared to the two types of negative sequences.

The compositions of the branched-chain amino acids Leu and Ile are lower in the anti-aging peptides compared to both random and antimicrobial peptides, as shown in Fig. 2. This is consistent with the depletion of II, IL, LI, and LL dipeptides in positive sequences compared to the two types of negative sequences (Fig. S2). It is obvious from Fig. 2 that antimicrobial peptides and random peptides exhibit distinct compositions in certain amino acids, such as Asp, Glu, Lys, Gln, and Trp, indicating that the machine learning models trained on DS1 and DS2 may have different decision boundaries.

Fig. 2
figure 2

The amino acid composition of the negative sequences from DS1 (blue), the negative sequences from DS2 (red), and the positive sequences from DS1 and DS2 (denoted as DS12, green) for (A) the main dataset and (B) independent test set.

Sequence logo

We generated sequence logos for our datasets using the Two Sample Logo website54. In accordance with a previous study55, we took the 5 amino acids (the minimum peptide length) from the N and C termini of each peptide and concatenated them to form a fixed vector of 10 residues. The sequence logo shows the amino acids that are enriched or depleted at each position in the positive sequences relative to the negative ones. The sequence logos for DS1 and DS2 are illustrated in Supplementary Figs. S3A,B, respectively. For both DS1 and DS2, Gly, Gln, Met, and Pro are enriched in multiple positions. However, positively charged residues Lys and Arg are enriched only in DS2. It can be seen that Leu and Ile are depleted in both DS1 and DS2, whereas Lys, Arg, and Trp are depleted only in DS1, and Ser is depleted only in DS2. In general, the sequence logos are consistent with the amino acid composition presented in Fig. 2. The results also suggest that DS1 and DS2 share certain common amino acid propensities but exhibit some fundamental differences.

Selected feature subsets

Figure 3A and B show the average MCCs of the top 5 ML models across different feature numbers for DS1_Main and DS2_Main, respectively. The best feature numbers for DS1_Main and DS2_Main are 50, because it is associated with the highest average MCC. The top five models for DS1 are GBC, CB, RF, LGBM, and ET, whereas those for DS2 are ET, RF, LGBM, GBC, and CB, all of which are more advanced and complex ML models. A complete list of the 50 features for DS1_Main and DS2_Main is listed in Supplementary Tables S3 and S4. The selected features of DS1_Main and DS2_Main belong to 20 and 21 feature types, respectively, as shown in Supplementary Table S5. The selected feature subsets of DS1 and DS2 share 15 common feature types, including ABHPRK56, APAAC, CKSGAAGP57, CKSAAP58, CTDC59, CTDD59, CTriad, DDE, Ez, GTPC60, MSW61, NMBroto62, PAAC, QSO63, and Z5. Among the top 10 feature types for DS1_Main and DS2_Main in Table S5, eight of them belong to the common feature types, indicating a certain degree of similarity. Among the features selected for DS1_Main, the physicochemical feature type ABHPRK is the most prevalent, with 7 features selected, while only 3 ABHPRK features are selected for DS2_Main. Conversely, DDE is the most prevalent feature type for DS2_Main, with 7 features selected, and yet the number of selected DDE features for DS1_Main is 4. Similar situations are observed for the feature types of CKSAAGP (6 for DS1_Main and 2 for DS2_Main) and CTDC (3 for DS1_Main and 1 for DS2_Main), revealing inherent differences between the two selected feature subsets.

The selected features for DS1_Main are associated with important attributes, including hydrophobicity, hydrophilicity, charge, and aliphaticity (observed from the feature types of APAAC, CKSAAGP, CTDC, CTDD, and GTPC in Table S3), while the attributes for DS2_Main are only hydrophobicity and charge (Table S4). It is found that the selected features correspond to important amino acids mentioned in the previous sections, such as Gly (AAC_G for DS1_Main and APAAC for DS2_Main), Pro (DDE_PP for DS1_Main and APAAC for DS2_Main), and Ile (DDR_I for DS1_Main). The selected dipeptide-related features, such as AP and WT for DS1_Main and MQ, PG, RA, RP, RQ, TW, and GF for DS2_Main, are enriched in the positive sequences, as illustrated in Fig. S2. Similarly, dipeptides WC (for DS1_Main) and NT (for DS2_Main) are enriched in the negative sequences.

Fig. 3
figure 3

The average MCCs of the top five performing models based on various feature number subsets on (A) DS1_Main and (B) DS2_Main. The best feature subset with the highest average MCC is circled in red.

Benchmark results of cross validation

The benchmark results of cross validation on DS1_Main and DS2_Main are shown in Table 1. The top four models (ET, GBC, CB, and RF) for DS1_Main generate accuracies between 0.953 and 0.956, F1-scores between 0.698 and 0.726, AUCs between 0.953 and 0.962, and MCCs between 0.693 and 0.715, indicating fairly accurate overall predictions. ET demonstrates superior performance across multiple metrics, outperforming other models in accuracy, precision, AUC, and MCC. The top two models, ET and GBC, demonstrate somewhat comparable accuracy, specificity, and MCC. The major difference between the two is that ET yields 11.1% higher precision than GBC, and yet GBC yields 7.4% higher recall than ET. MLP and QDA are considered the worst-performing models, judging by accuracy, AUC, and MCC. MLP suffers from a recall of 0.301, the lowest among all the models, whereas QDA suffers from a precision of 0.525, also the lowest among all the models.

For DS2_Main, the top 5 models (ET, GBC, CB, RF, and LGBM) generate accuracies between 0.937 and 0.943, F1-scores between 0.500 and 0.579, AUCs between 0.847 and 0.893, and MCCs between 0.530 and 0.580. These values are lower than those from DS1_Main, indicating that differentiating anti-aging peptides from random peptides is more challenging. It can also be observed that the top 5 models excel in precision (above 0.8) while showing potential for improvement in recall (below 0.46), suggesting that enhancing sensitivity could further optimize their predictions. The bottom two models for DS2_Main, namely, QDA and LDA, generate recalls (0.424 and 0.344) comparable to the top 5 models (in between 0.352 and 0.456), while their precisions (0.550 and 0.547) are significantly lower than those from the top 5 models (in between 0.806 and 0.982). The clear weakness of QDA and LDA compared to the top models lies in their significantly lower precision.

Table 1 The benchmark results of cross validation on DS1_Main and DS2_Main.

Benchmark results of independent test

The benchmark results on DS1_Indp and DS2_Indp are shown in Table 2. The top 5 models on DS1_Indp (LGBM, CB, GBC, ET, and RF), which are more advanced ensemble ML methods, achieve reasonably accurate predictions, with accuracies ranging from 0.943 to 0.955, AUCs ranging from 0.959 to 0.963, and MCCs ranging from 0.660 to 0.692. Since the dataset is highly unbalanced (positive-to-negative ratio of 1:10), the large number of true negatives significantly contributes to the measures of accuracy, AUC, and specificity. LGBM is the top-performing model, yielding the highest accuracy (0.955), precision (0.885), and MCC (0.692) among all the models, and yet it suffers from a relatively lower recall of 0.575. In contrast, the second-best model, CB, achieves the highest recall of 0.775 with a relatively lower precision of 0.660. MLP’s poor performance, reflected in its low F1-score (0.320) and MCC (0.376) compared to other methods, is attributed to its particularly low recall of 0.200. The ROC curves of all the models on DS1_Indp are shown in Supplementary Fig. S4A.

Similarly, the top 5 models on DS2_Indp (ET, CB, RF, LGBM, and GBC) are more advanced ensemble ML methods, achieving accuracies ranging from 0.932 to 0.941, AUCs ranging from 0.788 to 0.829, and MCCs ranging from 0.500 to 0.580. The accuracy, AUC, and MCC values for DS2_Indp are in general lower than those for DS1_Indp, consistent with our previous observation that differentiating anti-aging peptides from random peptides is more challenging. Similar to the cross-validation results of DS2_Main, all the methods suffer from low recall (between 0.3 and 0.45), suggesting that improving the models’ sensitivity can significantly enhance the overall prediction performance. KNN and QDA models (for DS2_Indp) showed comparable recalls (0.350 and 0.450) to the top 5 models, but their much lower precision values (0.560 and 0.439) result in poor overall prediction performance. The ROC curves of all the models on DS2_Indp are shown in Supplementary Fig. S4B.

Table 2 The benchmark results of independent test on DS1_Indp and DS2_Indp.

We analyzed the correlation between the prediction probability of a peptide and the true positive rate (TPR), which represents the likelihood that a peptide is an anti-aging peptide. Ideally, a higher prediction probability assigned by a model corresponds to a higher TPR if the model has good predictive capability. TPR was calculated by dividing the number of anti-aging peptides by the number of peptide sequences predicted within the range of prediction probability. It can be seen in Fig. 4A,B that our models in general show strong positive correlations between their prediction probability and TPR. The only curve that does not exhibit a monotonic increase in Fig. 4B is QDA, the worst-performing method for DS2_Indp.

Fig. 4
figure 4

The sequence number and true positive rate plotted with respect to the prediction probability for (A) DS1 and (B) DS2. Prediction probability is obtained from the best-performing model on both DS1_Indp (ET) and DS2_Indp (LGBM). The true positive rate is calculated by the number of AAGPs divided by the total number of sequences predicted within each range of prediction probabilities.

Prediction accuracy with respect to peptide properties

We further analyzed the prediction results of the best models on independent tests, namely, LGBM for DS1_Indp and ET for DS2_Indp, with respect to multiple peptide properties characterized by the ratios of hydrophobic, hydrophilic, aliphatic, aromatic, charged, and uncharged residues within peptides. The groupings of amino acids based on their properties are listed in Supplementary Table S6, and the analysis results are illustrated in Fig. 5. Prediction accuracy positively correlates with the ratios of both hydrophobic and hydrophilic residues in DS1_Indp and DS2_Indp (Fig. 5A,B,G,H). It can also be observed from Supplementary Tables S3 and S4 that multiple features regarding hydrophobicity and hydrophilicity are selected for both DS1_Main and DS2_Main. Despite the importance of hydrophobicity for skin permeability, many anti-aging peptides are hydrophilic and require chemical conjugation or carriers for effective permeation of the skin epidermis64.

Peptides’ aliphaticity and aromaticity positively correlate with prediction accuracy for DS1_Indp (Fig. 5C,D) but show opposite trends for DS2_Indp (Fig. 5I,J). This is in good agreement with the observation from Supplementary Tables S3 and S4 that more features regarding aliphatic property are selected for DS1 compared to DS2, and features regarding aromaticity are only selected for DS1. It can also be observed from Fig. 2A,B that the AACs of aromatic residues Phe, Trp, and Tyr demonstrate larger differences between the positive and negative sequences for DS1 (blue bars vs. green bars) than those for DS2 (red bars vs. green bars). Similar situations can also be observed for aliphatic amino acids Leu, Ile, and Pro. In both DS1 and DS2, prediction accuracy shows a positive correlation with the ratio of charged residues and a negative correlation with the ratio of uncharged residues, consistent with a previous report that many of the anti-aging peptides are charged7. Several features regarding charge are also selected for both DS1_Main and DS2_Main (Supplementary Tables S3 and S4).

Fig. 5
figure 5

Analysis of prediction accuracy against different peptide properties for the prediction results of DS1_Indp (panels AF) and DS2_Indp (panels GL) based on LGBM and ET, respectively.

Interpretation of the ML models using SHAP

SHAP (SHapley Additive exPlanations analysis) analysis is a robust tool based on game theory to explain how different features contribute towards the prediction output of a model. Positive SHAP values show that the features push the model toward predicting a positive outcome, whereas negative SHAP values show that the features push the model to make negative predictions. Here, we analyzed LGBM and ET, our best models on DS1_Indp and DS2_Indp, judging by MCC, respectively. Figure 6A,B illustrate the SHAP values for the top 20 features for DS1_Indp and DS_2_Indp, respectively. The two models share several common feature types, including compositional features DDE and CKSAAGP, and the physicochemical features CTDC and ABHPRK. Most of the features selected for DS1_Indp are physicochemical features, many of which are related to attributes of charge, aliphaticity, hydrophilicity, and hydrophobicity. A total of 7 of the top 20 features (2 for APAAC, 3 for CKSAAGP, and 2 for formula) for DS1_Indp are compositional features. On the other hand, the top 20 features for DS2_Indp include 12 compositional features (6 for DDE, 2 for CKSAAGP, 2 for CKSAAP, 1 for CTriad, and 1 for PAAC), much more than the compositional features for DS1_Indp.

The analysis suggests that the two models (LGBM and ET) learned to distinguish positive and negative sequences from the two datasets based on different characteristics. In DS2_Indp, the negative samples are random peptides selected from Swiss-Prot, with AAC values resembling the background sequence space. Intuitively, compositional features would be useful in distinguishing these random peptides from anti-aging peptides, which possess specific biological functions and distinct compositional patterns of amino acids. In contrast, the compositional features exhibit less pronounced differences between the positive samples (anti-aging peptides) and negative samples (anti-microbial peptides) from DS1_Indp, leading LGBM to rely more heavily on physicochemical features for classification. The results showcase the capability of our prediction pipeline in the adaptive selection of informative features, contributing to accurate predictions. It is reported that the composition of the training data, specifically the type of negative sequences, affects the decision boundary and final outcome of an ML model65. The experimental design based on the two datasets and two sets of trained ML models offers significant potential for advancing the field of anti-aging peptide identification.

Fig. 6
figure 6

The beeswarm plots of SHAP values for the top 20 features based on (A) DS1 and (B) DS2.

Independent tests on datasets with varying positive-to-negative ratios

Independent tests on datasets with varying positive-to-negative (P/N) ratios facilitate evaluating model robustness and generalizability across different class imbalance scenarios that may be encountered in real-world applications. Random antimicrobial peptides and random peptides from the Swiss-Prot database were extracted to form new independent test datasets similar to DS1_Indp and DS2_Indp, respectively, at P/N ratios of 1:5, 1:8, 1:13, and 1:15. Our best trained models, namely, LGBM for DS1 and ET for DS2, were evaluated on these datasets, and the results are shown in Supplementary Table S7.

It can be seen for both DS1 and DS2, model performances on datasets with P/N ratios of 1:5, 1:8, and 1:13 remain comparable to that achieved on the original independent test datasets (P/N ratio of 1:10) across MCC, AUC, and F1-score, demonstrating robust generalizability. However, at the more extreme P/N ratio of 1:15, performance degradation becomes evident. LGBM on DS1 exhibits decreases of 5% and 8% in MCC and F1-score, respectively, relative to the performance on DS1_Indp. Similarly, ET on DS2 shows decreases of 7% and 11% in MCC and F1-score, respectively, compared to the prediction on DS2_Indp. The performance decline is attributed to significantly reduced sensitivity (> 10% for both DS1 and DS2), indicating that future improvement is needed for prediction on highly imbalanced anti-aging peptide datasets.

Computational efficiency of AAGP

The computational efficiency of AAGP was evaluated by measuring the execution time for its six core components: feature encoding, feature ranking, heuristic algorithm for feature selection, hyperparameter optimization, cross validation, and independent test. All experiments were conducted on a standard workstation equipped with an Intel® Core™ i7-8700 CPU (3.20 GHz base frequency), 32 GB RAM, and no GPU acceleration. The total processing time for complete experiments on DS1 and DS2 was less than 30 and 40 min, respectively. Detailed timing breakdowns for individual processing steps are provided in Supplementary Table S8.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *