scHiCyclePred: a deep learning framework for predicting cell cycle phases from single-cell Hi-C data using multi-scale interaction information

Machine Learning


Overview of scHiCyclePred

The deep learning-based framework of scHiCyclePred consists of two crucial steps: the extraction of multiple feature sets and a CNN model based on multi-feature fusion (Fig. 1). The former extracts features of chromatin’s three-dimensional structure from the scHi-C data based on multi-scale interaction information. This step involves extracting the following feature sets: (1) CDD feature set from the overall cellular interaction information, (2) BCP feature set from the overall chromatin interaction information, and (3) SICP feature set from the intra-domain interaction information on chromatin. To integrate the knowledge of multi-scale interactions in cells and intuitively predict the cell cycle stage, we develop a CNN model based on multi-feature fusion that integrates the three feature sets generated by the convolution module.

Fig. 1: The framework of scHiCyclePred.
figure 1

a The extraction of multiple feature sets. scHiCyclePred combines read pair locus mapping file and chromatin interaction pair file to generate a unique chromosome contact matrices for each chromosome in every cell. To enhance cell cycle prediction accuracy and reveal variations in three-dimensional structure across different cell cycles, we extract features representing chromosome three-dimensional structure from diverse perspectives. Specifically, we extract three feature sets: contact probability distribution versus genomic distance (CDD), bin contact probability (BCP), and small intra-domain contact probability (SICP). b CNN model based on multi-feature fusion. We develop a deep learning model that combines convolution and feature fusion modules to accurately predict cell cycle phases. c The usage of scHiCyclePred. Directly apply the trained model to predict the cell data with unknown cell cycles.

In the CNN model based on multi-feature fusion, three feature vectors for each cell are input into the model, which generates three vectors in parallel after passing through two convolution modules composed of a Conv1d layer, BatchNorm layer, Maxpool layer, and Dropout layer, followed by a flattening process. These three generated vectors are then merged into a single vector. The scores from different categories are mapped using a linear layer and “log_softmax”, and the classification outcome is determined by the index with the highest score. In the following sections, we provide a detailed description of the workflows of the two steps in the scHiCyclePred framework.

Effectiveness evaluation of single feature set and multiple feature sets

To demonstrate the effectiveness of the multi-scale contact probability feature sets, we validate and analyze the performance of our extracted feature sets in this section. To accomplish this, we input each of the three feature sets into the network model independently and validate the classification performance obtained by using each feature set separately. To ensure the robustness of our results, we partition the Nagano_dataset into fifty independent training and testing sets by altering the random seeds, with the Nagano_dataset being the total of each training and testing set. Subsequently, we evaluate the prediction performance of the three feature sets using four evaluation metrics: accuracy (ACC), F1 Score (F1), area under the receiver operating characteristic curve (AUC), and area under the precision-recall curve (Average Precision, AP). Specifically, the AUC value and AP value represent the area under the ROC curve and the area under the Precision-recall (PR) curve, respectively (Fig. 2a).

Fig. 2: The effectiveness evaluation of the single feature set and the multiple feature sets.
figure 2

a Performance evaluation of five single feature sets. b The accuracy of five individual feature sets across cell cycle phases. c Performance comparison between single feature set and multiple feature sets. The lower and upper edges of the boxplot represent the minimum and the maximum values of the results, respectively. The bottom edge of the box represents the first quartile (Q1), while the top edge represents the third quartile (Q3). The median value is depicted by a black line. (*** indicates P < 1\(\times\)10−3, ** indicates P < 1\(\times\)10−2, * indicates P < 5\(\times\)10−2).

The results indicate that the five feature sets are effective in predicting the four cell cycle phases (Fig. 2b). Overall, the BCP and SICP features demonstrate superior accuracy and stability compared to the other three features. In contrast, the stability of the Insulation Score of Each Bin (INS) feature is relatively lower. Although the Pairs’ Contact Coverage (PCC) feature occasionally exhibits higher accuracy than CDD, its concatenation of chromosome segment information at three resolutions (100 kb, 500 kb, 1 Mb) results in computationally expensive operations, especially in deep models, potentially significantly increasing the model’s runtime. Additionally, the accuracy of the PCC feature is closely tied to the selection of hyperparameters, requiring different hyperparameters for different datasets, which demands a certain level of experience and methodology. Therefore, our model ultimately selects the relatively simpler CDD, BCP, and SICP features, which do not require manual parameter tuning, for further analysis in the subsequent sections. The detailed prediction results of the three feature sets for different phases show that: (1) In the G1, Early-S, and Late-S/G2 phases, the results of the three feature sets are similar; (2) In the Mid-S phase, the results of the SICP and BCP feature sets are significantly larger than those of the CDD feature set (Fig. 2b). The prediction performance of the Mid-S phase is not adequately captured based on the CDD feature set, indicating that the change in contact probability distribution is relatively minor. This observation aligns with the findings reported in CIRCLET.

To evaluate the effectiveness of our CNN model based on multi-feature fusion, we compare the performance of the three-feature-set fusion model with that of six other models constructed by retaining the corresponding feature extraction modules: three single-feature-set models (CDD, BCP, and SICP) and three two-feature-set fusion models (BCP-CDD, CDD-SICP, and BCP-SICP). We utilize the fifty independent training and testing sets mentioned in the previous section for this experiment. We use four evaluation metrics, namely ACC, F1, Precision, and balanced accuracy (BACC)24,25,26, to assess the prediction performance of each model (Fig. 2c).

The effectiveness of our extracted features is demonstrated by the fact that the three single-feature-set models yield higher performance (Fig. 2c). Furthermore, feature fusion enhances the performance of the three two-feature-set models, highlighting the importance of our feature fusion. The final results indicate that the CNN model based on multi-feature fusion exhibited the best performance in terms of ACC, F1, Precision, and BACC, with the least difference between the maximum and minimum values, i.e., the most stable result. Based on these comparison results, our CNN model based on multi-feature fusion can effectively fuse feature sets at all scales and deliver superior cell cycle phase prediction ability.

Performance evaluation of scHiCyclePred in predicting cell cycle phases

To demonstrate the superiority of our proposed scHiCyclePred method, we compare it to two cell cycle trajectory construction methods (Nagano_method and CIRCLET) using four evaluation metrics in this section. Although cell cycle trajectory construction is not primarily designed for predicting cell cycle phases, in order to evaluate Nagano_method’s and CIRCLET’s results efficiently, we generate predictive labels based on cell number and cycle order. As Nagano_method and CIRCLET methods do not require the division of training and testing sets, both methods performed predictions directly on the original Nagano_dataset. To compare scHiCyclePred with these two methods, we use them to calculate the average of the ACC, F1, and Precision metrics obtained from fifty testing sets. Additionally, we evaluate the effectiveness of our constructed deep learning model by employing three conventional machine learning methods, namely SVM27, Logistic Regression28, and Random Forest29,30. The same fifty independent training and testing sets mentioned earlier are utilized for this experiment (Fig. 3).

Fig. 3: Performance comparison of scHiCyclePred with other methods on different datasets.
figure 3

a Performance comparison of scHiCyclePred with Nagano_method and CIRCLET on Nagano_dataset. Each bar in the graph represents the average value, while the bottom of the vertical line indicates the minimum value, and the top represents the maximum value. b Performance comparison of scHiCyclePred with other machine learning methods on Nagano_dataset. c Performance comparison of scHiCyclePred with CIRCLET on Liu_dataset. The bar in the graph represents the average value, while the bottom of the vertical line indicates the minimum value, and the top represents the maximum value. d Performance comparison of scHiCyclePred with other machine learning methods on Liu_dataset. The lower and upper edges represent the minimum value and the maximum value of the results, respectively. The bottom edge of the box represents the first quartile (Q1), the top edge of the box represents the third quartile (Q3), and the black line represents the median value. (*** indicates P < 1\(\times\)10−3).

The results indicate that scHiCyclePred outperforms Nagano_method and CIRCLET in four metrics: ACC, F1, Precision, and Recall (Fig. 3a). Furthermore, scHiCyclePred still achieves optimal performance in terms of ACC, F1, Precision, BACC, and Recall metrics compared to using the three conventional machine learning methods (Fig. 3b). These findings demonstrate that the scHiCyclePred method achieves superior performance in predicting cell cycle phases.

In addition to comparing scHiCyclePred with CIRCLET on the Nagano_dataset, we also evaluated its performance on the complex tissue dataset (Liu_dataset) to validate its generalization and stability. It is worth noting that CIRCLET incorporates PCC features, and the choice of threshold significantly impacts its performance. Therefore, we conducted experiments to assess performance across different thresholds and selected the best-performing result for comparison (Supplementary Table 1). Additionally, to evaluate the effectiveness of our deep learning model, we employed three traditional machine learning methods: SVM, Logistic Regression (LR), and Random Forest (RF). These experiments utilized the same 50 independent training and testing sets as previously mentioned (Fig. 3).

The results indicate that scHiCyclePred achieved an accuracy of 0.76, surpassing CIRCLET’s accuracy of 0.37 by 0.39. Additionally, scHiCyclePred outperformed the best-performing method on the other three metrics by a significant margin (Fig. 3c). It is noteworthy that our model’s performance metrics substantially exceeded those of the three machine learning methods (Fig. 3d). In conclusion, this suggests that our model maintains optimal and stable performance across different datasets.

Robustness validation of scHiCyclePred

We evaluate the scHiCyclePred model and three machine learning methods using two distinct approaches to validate scHiCyclePred’s robustness on datasets for the following purposes: (1) Validating the effectiveness of scHiCyclePred on imbalanced datasets by downsampling the original Nagano_dataset. (2) Testing the effectiveness of scHiCyclePred on drop-processed datasets. For the drop experiment, the Nagano_dataset’s chromatin interaction pair file (raw_data file) is initially split into a training set (\({{train}}_{1}\)) and a testing set (\({{test}}_{1}\)) at a ratio of 80% to 20%, ensuring that the sum of \({{train}}_{1}\) and \({{test}}_{1}\) is equal to the number of raw_data_file. The model is then trained with \({{train}}_{1}\) to produce \({{model}}_{1}\), which is subsequently tested with \({{test}}_{1}\) The random seed is then altered, and four additional sets of training and testing data are generated in the same manner. Next, for each set of testing set, 5%y to 50%y (with an increment of 5%) of rows of data are randomly selected, where y represents the total number of rows in the testing set. Taking into account that interaction pairs with contact numbers of 1 or 2 account for 98.8% of the total number of interaction pairs (details regarding the number of interaction pairs with varying contact numbers, as well as their corresponding proportions are provided in Supplementary Fig. 1), we adjust the number of contacts between two segments (counting columns) for the selected rows using the following procedure: randomly adding or subtracting one from the number of contacts. Subsequently, each training set is matched to 10 testing sets, resulting in a total of 50 testing sets across the five training sets. Each group of experiments is trained using its corresponding training set, and only the 10 testing sets in the same group as the current training set are utilized to test the trained model to prevent data leakage. Taking all aspects into account, the comparison results of ACC, F1 score, Precision, and BACC metrics demonstrate the effectiveness of scHiCyclePred on the drop-processed datasets (Fig. 4a).

Fig. 4: Performance of scHiCyclePred on imbalanced datasets and drop-processed datasets.
figure 4

a Performance of scHiCyclePred and machine learning methods on the drop-processed datasets. The lower and upper edges of the boxplot represent the minimum and the maximum values of the results, respectively. The bottom edge of the box represents the first quartile (Q1), the top edge represents the third quartile (Q3), and the black line represents the median value. b Distribution of cells across cell cycle phases in nine imbalanced datasets (A-I). Further performance evaluation is conducted on this dataset in Fig. 4c, d. c, d Evaluation of scHiCyclePred and machine learning methods on an imbalanced dataset using ACC and Recall metrics.

For the downsampling experiment, we partition the Nagano_dataset into nine imbalanced datasets (labeled A to I) composed of different proportions (Fig. 4b). Using a ratio of 8:2, each of the nine datasets is split into two parts: a training set and a testing set. The training set is used to train the model, whereas the testing set is used to evaluate its performance. The comparison results of the four evaluation metrics (i.e., ACC, F1, Precision, and BACC) demonstrate the effectiveness of scHiCyclePred on imbalanced datasets (Fig. 4c, d).

Overall, these results underscore that scHiCyclePred outperforms other methods and exhibits robustness on drop-processed datasets (Fig. 4a). Moreover, scHiCyclePred demonstrates robust stability and generalization across various imbalanced datasets. In contrast, the three machine learning methods failed to achieve comparable results. (Fig. 4c, d and Supplementary Fig. 2). In summary, the results from both experiments demonstrate the effectiveness and robustness of scHiCyclePred in predicting cell cycle phases on various datasets.

Analysis of chromatin change patterns across various cell cycle phases

We further investigate the pattern of changes in the 3D structure of chromatin during different cell cycle phases. Specifically, we utilize the SHapley Additive exPlanations (SHAP) method27,29 to analyze the feature importance of the three feature sets during the four cycles. The SHAP graph displays the vertical axis representing features and the dots representing samples, with redder colors indicating higher feature values and bluer colors indicating lower feature values. The horizontal axis demonstrates the SHAP value, where positive values indicate a positive effect on the prediction, and negative values indicate a negative effect.

Regarding the CDD feature set, our analysis is centered on the top 20 significant features. The evaluation of the importance of the CDD feature set reveals that 12 features are prominently represented within the top 20 features across the four cell cycle phases: CDD61-70, CDD72, and CDD74 (Fig. 5). In the G1, Early-S, and Mid-S phases, there is a tendency for bluer dots to cluster in the positive semi-axis, indicating a positive effect, whereas during the Late-S/G2 phase, redder dots predominantly cluster in the positive semi-axis. This observation suggests that the proportion of short-distance chromatin interactions gradually increases with cell cycle progression, leading to a progressive tightening of the three-dimensional chromosome structure. Notably, this finding is consistent with the results reported by Ye et al6. In addition, CDD111, CDD117, and CDD119 appear in the 20 most important features of Early-S and Late-S/G2 cycles, indicating that there may be more significant changes in long-distance contacts from Early-S to Late-S/G2 cycles.

Fig. 5: Importance evaluation results of the first 20 features in the CDD feature set.
figure 5

a G1 phase. b Early-S phase. c Mid-S phase. d Late-S/G2 phase. The SHAP graph displays the SHAP value on the horizontal axis, representing the assigned importance to each feature in the given sample. The vertical axis represents features, and the dots represent samples, with redder colors indicating higher feature values and bluer colors indicating lower feature values. We analyze the top 20 significant features of the CDD feature set and find that 12 features are present across all four cell cycle phases: CDD61-70, CDD72, and CDD74. Features with the same color in the graph represent identical features.

For the BCP feature set and SICP feature set, we primarily analyze the top 50 important features. The results of the BCP feature set’s importance evaluation reveal that for fragments on different chromatin, their three-dimensional structure change patterns vary during the cell cycle phases. Based on the varying contact numbers, the evaluation of the BCP feature set’s importance indicates the following: (1) BCP41 of chr15 and BCP64 of chr4: In the G1 phase, the majority of bluer colors are concentrated in the negative range, indicating a negative effect. In contrast, during the Early-S phase, most bluer colors are concentrated in the positive range, signifying a positive effect. Overall, this suggests that during the Early-S phase, the chromatin contact numbers decrease compared to the G1 phase, indicating a gradual loosening of its three-dimensional structure; (2) BCP98 of chr6: In the G1 phase, the majority of bluer colors are concentrated in the positive range, indicating a positive effect. In contrast, during the Early-S phase, most bluer colors are concentrated in the negative range, signifying a negative effect. Overall, this suggests that during the Early-S phase, the chromatin contacts increase compared to the G1 phase, indicating a gradual tightening of its three-dimensional structure; (3) BCP142 of chr5: In the G1 phase, the majority of bluer colors are concentrated in the negative range, indicating a negative effect. In contrast, during the Early-S, Mid-S, and Late-S/G2 phases, most bluer colors are concentrated in the positive range, signifying a positive effect. In summary, we find that the chromatin contact numbers in the remaining three phases gradually decrease compared to the G1 phase, indicating a gradual loosening of their three-dimensional structure (Fig. 6a–d).

Fig. 6: Importance evaluation results of the first 50 features in the BCP feature set.
figure 6

a BCP feature set in G1 phase. b BCP feature set in Early-S phase. c BCP feature set in Mid-S phase. d BCP feature set in Late-S/G2 phase. The SHAP graph displays the SHAP value on the horizontal axis, representing the assigned importance to each feature in the given sample. The vertical axis represents features, and the dots represent samples, with redder colors indicating higher feature values and bluer colors indicating lower feature values. Features with the same color in the graph represent identical features.

The results of the importance evaluation for the SICP feature set reveal the following insights: (1) During the transition from the G1 to the Late-S/G2 phases, the majority of bluer color clusters gradually transform into redder color clusters and eventually transition back to bluer color clusters. This suggests that their three-dimensional structure undergoes a transition from loosening to tightening, subsequently followed by a trend towards loosening again. This phenomenon is particularly evident in the analysis of SICP127 and SICP128 of chr10; (2) The three-dimensional structure change patterns of adjacent bins exhibit similarities. For instance, SICP3-5 of chr11 and SICP87 and SICP88 of chr12, etc demonstrate similar patterns of change; (3) From the perspective of the three-dimensional structural changes in the bin neighborhood, the prominent contact patterns of the Mid-S cycle and the Late-S/G2 cycle are relatively similar (Fig. 7a–d). One possible reason for (3) may be that the degree of three-dimensional structure change of chromatin gradually decreases from the Mid-S cycle to the Late-S/G2 cycle. This finding is consistent with previous findings6.

Fig. 7: Importance evaluation results of the first 50 features in the SICP feature set.
figure 7

a SICP feature set in G1 phase. b SICP feature set in Early-S phase. c SICP feature set in Mid-S phase. d SICP BCP feature set in Late-S/G2 phase. The SHAP graph displays the SHAP value on the horizontal axis, representing the assigned importance to each feature in the given sample. The vertical axis represents features, and the dots represent samples, with redder colors indicating higher feature values and bluer colors indicating lower feature values. Features with the same color in the graph represent identical features.

In this study, we meticulously annotate the patterns associated with each stage of the cell cycle and link them to specific genes based on feature importance ranking. The number of these genes varies depending on the number of genes corresponding to the features. Specifically, as the CDD feature set captures information from all chromosomes and cannot extract corresponding chromosome segments (bin), we proceed to rank and select the top 50 features based on feature importance in the BCP and SICP features. These selected features correspond to chromosome segments (bin) that map to all genes in the reference genome, with each bin having a size of 1 Mb. To validate the accuracy of these patterns, we conduct an extensive analysis of relevant research literature concerning the cell cycle. Our findings are as follows: (1) In the G1 phase, we discover a significant correlation between the gene lists from papers31,32,33 and the genes annotated by our patterns, with the same p-value of 0.005858. (2) During the Early-S phase, we observe a strong correlation between the gene lists in papers34,35,36 and the genes annotated through our patterns, with p values of 2.388 × 10−5, 0.0003518, and 0.001689. (3) Similarly, for the Mid-S phase, there is a notable correlation between the gene lists from papers37,38,39 and our annotated genes, with p-values of 9.535 × 10−5, 0.00224, and 0.003723. (4) Finally, in the Late-S/G2 phases, we find a substantial correlation between the gene lists in papers36,40,41 and our annotated genes, with p values of 4.067 × 10−6, 0.0001014, and 0.0006912. In summary, our analysis consistently demonstrates a strong association between our annotated patterns and the cell cycle, reaffirming the precision and relevance of our approach.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *