Machine learning approach for prediction of TBM performance and risk of jamming in Himalayan geology using a cross-project tunnelling database

Model on BBDM project database

The regression models were trained and tested on the BBDM project dataset using an 80/20 split. The predictive performance of these models is summarized in Table 4. The evaluation metrics indicate that the R² values on the training set range from 0.957 to 0.980 and on the testing set range from 0.936 to 0.938, respectively, demonstrating high prediction accuracy.

Table 4 Prediction model performance in a single project (BBDM project) database.

As presented in Table 4, all ensemble and ANN models exhibit superior performance. However, these model training and validation processes were performed using data with similar TBM features and geological conditions from the Siwalik region of the Nepal Himalaya. To assess their robustness and practical applicability, it is crucial to evaluate their prediction performance on tunnelling datasets from other projects characterized by different geological conditions and TBM configurations. To address this, three cross-project scenarios (as discussed in Sect. 2.5) were implemented.

Scenario 1

In this scenario, all data from the BBDM project were used to train the models with the corresponding optimized hyperparameters. The trained models were then tested on an independent and unseen SMDM project dataset from a similar Siwalik geological region, although the two projects used slightly different TBM features. The performance metrics for these models are summarized in Table 5.

Table 5 Prediction model performance on cross–project database (Scenario 1).

The results shown in Table 5 indicate that the R² values on the training set range between 0.914 and 0.979 for Scenario 1. On the other hand, the R² values on the testing set vary widely, ranging from 0.416 to 0.838. While the training performance remains consistent, the prediction accuracy is found to be notably lower compared to that observed within the same project data. This is specifically the case for testing sets for the RF and ANN regression models, which exhibit relatively low R² values of 0.416 and 0.606, respectively. Meanwhile, the stacking, bagging, and XGBoost models performed relatively better, achieving fairly good R² values of 0.749, 0.823, and 0.838, respectively. These findings indicate that the predictive performance of RF and ANN models is highly sensitive to variations in geological conditions and machine parameters. Both models tend to overfit the training project data, capturing site-specific patterns that do not generalize well to other TBM projects, resulting in poor cross-project performance compared to other models. Hence, the prediction performance of the model analyzed with Scenario 1 indicates the need to include a diverse database to enhance reliability of the model.

Scenario 2

In Scenario 2, the Siwalik region datasets from the BBDM and SMDM projects were merged to create a combined dataset, which was then used to train the selected ML models. The trained models were subsequently tested on an independent, unseen dataset from the headrace tunnel of the Lesser Himalayan region in the SMDM project. The performance metrics for these models are summarized in Table 6.

Table 6 Prediction model performance on cross–project database (Scenario 2).

The results in Table 6 show that the R² values on the training set range between 0.961 and 0.983, while the R² values on the testing set range between 0.895 and 0.986. The results indicate that almost all models have demonstrated better performance than that in Scenario 1. These findings suggest that incorporating a broader range of TBM operational parameters in the ML process enhances model generalization, thereby improving prediction performance on unseen and diverse datasets.

Scenario 3

Since the merging strategy adopted in Scenario 2 demonstrated better performance, its application in Scenario 3 is expected to further enhance model accuracy. Therefore, datasets from both the BBDM and SMDM projects were combined using stratification to create a more diverse and representative dataset. This combined dataset encompasses varied operational conditions from both the Siwalik and Lesser Himalayan regions, as well as a wider range of geological and machine parameters.

Following the cross-validation strategy described in Sect. 2.5 (Scenario 3), the five-fold mean R² and error metric values with corresponding 95% CI for the various ML models are presented in Fig. 6a, b. The mean R² value of each model is represented by a circular marker at the top of the vertical bar. The colored bars indicate the 95% confidence intervals of the performance metrics for XGBoost (red), RF (green), bagging (blue), stacking (orange), and ANN (purple). Vertical black lines denote the corresponding error bars for each model.

All selected models show comparable R² values having greater than 0.960 (Fig. 6a). The margin of error ranges from ± 0.0032 to ± 0.0039, which indicates a very narrow interval. This demonstrates that the models generalize well across folds with highly consistent performance. As reported by Timilsina et al.⁴⁷, a narrow margin of error indicates low variance in model performance across repeated experiments. The results further confirm that the observed model performance is statistically very good (p-value < 0.001).

As seen in Fig. 6a, XGBoost and stacking achieved the highest mean R² values of 0.965, while ANN showed the lowest value of 0.960. Notably, the 95% CI for XGBoost across the five folds is narrower than that of the stacking model. Therefore, despite comparable mean R² values, a narrower CI reflects more balanced and robust performance. The error metrics of the selected ML models are presented in Fig. 6b. XGBoost and stacking models exhibit the lowest and comparable MAE, RMSE, and MAPE values. The RF model shows higher MAE and MAPE, while the ANN model exhibits a higher RMSE. Overall, the findings indicate that the model reliably captures the relationship between input features and TBM PRnet. The adopted approach ensures robust predictions under diverse geological and operational conditions. Similar performance trends were observed across all selected models (Table 7).

Table 7 Prediction model performance on cross–project database (Scenario 3).

The results summarized in Table 7 show high predictive accuracy with lower loss functions for Scenario 3. The R² values on the training sets range from 0.960 to 0.989, while on the test sets range from 0.960 to 0.965. Among the evaluated models, XGBoost and stacking have achieved highest R² value of 0.965, while ANN model showed the lowest value of 0.960. Despite this small variation, all models have demonstrated strong prediction capabilities with R² values exceeding 0.960, which confirms their robustness in predicting PRnet.

In summary, all selected models demonstrate good performance. Among these, the XGBoost and stacking models show highest performance. Notably, the XGBoost model exhibits a lower margin of error compared to the stacking model, despite comparable error metrics. Both models are suitable for further prediction; however, in this study, XGBoost was selected for subsequent analysis. Overall, the analysis shows that the use of combined stratified and cross-validation enabled the models to effectively capture the complexity and diversity of geological and TBM operation parameters.

SHAP-based interpretability analysis

Model transparency is essential for quantifying the contribution of individual input features to an ML model predictions. SHAP is an interpretability framework grounded in cooperative game theory. In this study, SHAP was employed to evaluate the relative importance and influence of the selected input features on the TBM PRnet. The global interpretation results generated using SHAP for the best-performing XGBoost model are presented in Fig. 7.

The mean absolute SHAP value for each input feature across the entire database indicates its average contribution to the TBM PRnet (Fig. 7a). As seen in the figure, the mean SHAP values display descending order of importance following their relative magnitudes. In Fig. 7a, PRchd is the most influential variable, contributing on average 11.45 mm/min to the TBM PRnet. In contrast, rock strength is the least influential variable, with an average contribution of only 0.05 mm/min. The other parameters such as CRS, RMR, torque, thrust, and weathering show average contribution values of 4.32, 0.91, 0.44, 0.39, and 0.18 mm/min, respectively.

The SHAP beewarm plot ranks the input features in descending order from top to bottom based on mean SHAP values for entire database (Fig. 7b). In the figure, the horizontal axis illustrates influence of each feature on the model’s prediction. The SHAP values of individual data points are distributed horizontally for each input feature. Data points on the right indicate positive SHAP values, meaning the feature increases the TBM PRnet, while data points on the left indicate negative SHAP values, meaning the feature decreases it. In addition, blue and red colors represent low and high feature values, respectively. Vertically stacked points reflect a higher density of SHAP values, highlighting regions where many observations have similar contributions.

As seen in Fig. 7b, the results show that PRchd has strong positive effect on PRnet compared to CRS and RMR. Among the response variables, thrust and torque exhibit a negative influence on the PRnet prediction. Weathering and rock strength demonstrate a neutral to slightly positive effect on PRnet.

In this study, dependence plots were employed to evaluate the effect of individual input features across the dataset. These plots illustrate the relationship between feature values and the model’s predicted outputs. The top three features consisting of PRchd, CRS, and RMR were selected to analyze their effect on the TBM PRnet (Fig. 8a, b,c). The dependence plot of each displays original values on the x-axis and corresponding SHAP values on the y-axis. The relationship between SHAP values and original values differs across features. As seen in Fig. 8a, b, PRchd and CRS exhibit a clear positive trend having approximately linear distribution with SHAP value ranges. For RMR values up to 40 show positive SHAP values, whereas RMR values higher than 40 show a negative trend. This suggests that RMR values up to about 40 improve the TBM PRnet, whereas higher values tend to reduce it.

SHAP values above the horizontal reference line (y = 0) contribute positively to PRnet, whereas values below this line influence negatively. For example, a PRchd value of around 9 mm/rev marks the transition from negative to positive contribution, shifting the model’s prediction toward higher PRnet (Fig. 8a). The vertical spread of SHAP values in each plot reflects the influence of interactions with other features. PRchd exhibits wider SHAP value ranges, followed by CRS and RMR. For example, in Fig. 8c, RMR value of 42 produces SHAP values ranging from 0 to – 3 mm/min, depending on interactions with other feature values associated with those observations.

TBM jamming risk assessment

In TBM excavation, unexpected jamming and TBM stuck events are among the most critical issues encountered when tunnel passes through weak rocks and fault zones. Jamming and TBM stuck not only reduce the excavation progress but also increase project costs and time of completion. As described earlier, the TBM response parameters such as torque, thrust, and corresponding rock mass conditions influence the TBM PRnet. The associated parameters from cross-project database were selected to assess the potential risk of TBM jamming.

TBM parameter behaviour assessment

TBM jammed events from both BBDM and SMDM projects were evaluated using TBM parameters of ten rings before each stuck section. The trend of all TBM jamming and stuck events for both projects is presented in Fig. 9a, b,c, d. At BBDM project, the TBM jammed at two locations, which are designated as ST1.1 and ST1.2. Similarly, at SMDM project, the TBM jammed at nine different locations, which are labeled from ST2.1 to ST2.9 in Fig. 9. The black dotted line in Fig. 9 represents the mean value of respective TBM parameter, which can be used as a reference for comparative assessment.

As seen in Fig. 9a, torque values fluctuate noticeably when approaching TBM jamming section. In most of the jamming cases, a sharp increase in torque is observed at one or two rings before hitting the jamming section. All jamming events exhibited significantly higher peak torque values compared to the mean value of combined database. The jamming events ST1.1 and ST1.2 generally follow this pattern, although their torque magnitudes remain below the mean value. In contrast, the events ST2.3 and ST2.7 do not obey this trend. Figure 9b illustrates behavior thrust while approaching the jamming section. Similar to torque, a steep increase in thrust at two rings before the jamming section with peak thrust values occurring at jamming events and exceeding the mean thrust of the combined dataset is observed.

As seen in Fig. 9c, the trend of PRnet exhibits significant fluctuations near the jamming section, where a sudden drop below the mean of combined database occurs at two rings before jamming. Figure 9d highlights surrounding rock mass quality conditions at ten rings before the TBM jamming section. The field mapping results indicate that the rock mass quality suddenly dropped from fair rock mass class (class III) or poor rock mass class (class IV) to very poor rock mass class (class V) at one or two rings before the jamming section. Subsequently, the TBM jammed at tunnel section where very poor rock mass conditions exist.

Prediction of potential jamming events

As discussed earlier, TBM operational parameters show high fluctuations under class V rock mass conditions, especially when approaching sections prone to TBM jamming. These parameters tend to spike sharply at stuck sections. As reported by Katuwal and Panthi³³, lower PRnet values (below 25th percentile) combined with large fluctuations in torque and thrust (exceeding 75th percentile) serve as strong indicators of potential challenges, such as TBM getting stuck or the cutterhead becoming jammed.

To assess these variability patterns in greater detail, the statistical distributions of torque, thrust, and PRnet for class V conditions are presented in Fig. 10. The vertical axis represents the frequency of occurrence, while the horizontal axis shows the parameter range within class V. Percentile lines (P₁, P₅, P₁₀, P₂₅, P₅₀, P₇₅, P₉₀, P₉₅, and P₉₉) are presented on the histograms using different colors and line styles. These percentiles help visualize the variability characteristics of each parameter and provide practical cutoff thresholds for assessing potential jamming risks. Noticeable changes in the behavior of torque, thrust, and PRnet can be observed across these percentile intervals (Fig. 10). For example, the lower PRnet values in class V (Fig. 10c) appear to serve as indicators of TBM jamming risk, consistent with the findings presented earlier in Sect. 2.3 (Fig. 4). Based on these results, PRnet values below the P₅ are classified as highly variable. Values between P₅ and P₂₅ are categorized as moderately variable, and those between P₂₅ and the mean are considered slightly variable. PRnet values exceeding the mean reflect relatively better TBM performance under class V conditions and are categorized as normal. This percentile-based threshold system is applied to evaluate PRnet variability within class V rock mass condition.

Further, Katuwal and Panthi (2025)³⁴ reported that thrust and torque requirements are generally lower in class V compared to class IV and III. However, this analysis showed thrust and torque exhibiting high variability beyond the P₉₅, which appears contradictory to typical expectations. Similar observations are also discussed in Sect. 2.3 (Fig. 4). Based on these results, a percentile-based variability scoring system was developed to classify the probability of potential jamming during TBM tunnelling. The assigned variability thresholds for torque, thrust, and PRnet, along with the corresponding variable classes and scores, are presented in Table 8.

Table 8 Thresholds for variability classification and associated scoring system.

In this study, the variability in TBM parameters is categorized into four classes: highly variable, moderately variable, slightly variable, and normal, as summarized in Table 8. This scoring system was used to analyze the variability conditions along the actual TBM jamming or stuck sections. The resulting classifications are presented in Table 9.

Table 9 TBM parameter variability conditions in very poor rock mass class (class V).

As shown in Table 9, the torque values exhibit variability ranging from normal to high. In some tunnel sections, variations in torque fall within the normal or slightly variable class indicating no TBM jamming. However, inconsistent torque results at actual TBM jamming sections represent difficulties in jamming risk evaluation. Therefore, torque parameter is excluded from jamming risk prediction model. On the other hand, the thrust and PRnet parameters consistently exhibited high variability across all jamming sections. Thrust values with high variability and PRnet values ranging from moderate to high variability one ring prior to jamming section are indicative of risk levels. In some jamming cases, such as ST1.2 and ST2.2, moderate to high variability is observed at two rings before the stuck sections. Similarly, for ST1.1, moderate to high variability is observed only at the jamming section itself, which can be attributed to a change in rock mass quality conditions from class III or IV to class V for the particular rings, denoted as not applicable (N/A). The finding indicates that variations in thrust and PRnet in the ring before reaching to jamming section can serve as reliable predictors of impending TBM jamming. Hence, a combined jamming risk (CJR) score is proposed to classify the risk level into four categories: high risk, medium risk, low risk, and no risk. The corresponding CJR score values for each risk category are summarized in Table 10.

Table 10 Risk level classification under very poor rock conditions.

The proposed CJR score risk assessment was applied to validate predictive performance on actual TBM jamming sections (Table 11). The CJR scoring system demonstrated fairly reliable performance by flagging a high-risk warning at least one ring prior to TBM jamming event. Table 11 indicates that, in addition to the known jamming sections, several other tunnel sections have potential risk for TBM jamming showing high-risk flags up to three rings before jamming event.

Table 11 CJR score based on risk level classification.

The performance of the CJR scoring system was further evaluated using a binary classification approach. A total of 982 segmental sections were assessed for class V rock mass conditions. Sections with a high risk level correspond to actual TBM jamming locations or potential jamming zones and are therefore categorized as High Risk sections. Sections with medium, low, or no risk are categorized as No Risk sections. The binary classification results are presented in Fig. 11a.

In the binary confusion matrix (Fig. 11a), the high-risk category is considered the positive class, while the no-risk category is considered the negative class. The CJR scoring system correctly predicted all 11 actual TBM jamming sections, which indicated zero incorrectly predicted high-risk cases. Thus, the true positive (TP) rate is 1.00, and the false negative (FN) rate is 0.00. On the other hand, the actual 971 no jamming section are correctly predicted with 967 as no risk, while 4 are incorrectly predicted as high risk. This corresponds to a true negative (TN) rate of 0.996 and a false positive (FP) rate of 0.004.

In TBM risk assessment, FN predictions are more critical than FP predictions due to their direct implications for operational safety. In the case of an FN, zones with an actual high risk of TBM jamming are incorrectly classified as low- or no-risk conditions. Subsequently, the tunnelling crew may continue boring operation under normal operating parameters, which may result in unexpected TBM jamming and significant operational disruptions. Conversely, in the case of FP, ground conditions that are actually safe are classified as high-risk. Under such conditions, the tunnelling crew may adopt preventative measures, including adjustments in TBM control parameters exploiting prior experience, observational judgment, or predefined empirical adjustment ranges. In addition, detailed ground investigations and temporary stabilization measures may be carried out if judged necessary. Although FP predictions may lead to a reduced advance rate and increased operational costs due to additional investigations and operations, the tunnelling process remains within a safe operational situation.

The positive class performance was further evaluated using the receiver operating characteristic (ROC) curve and area under the curve (AUC). As shown in Fig. 11b, the x-axis represents the false positive rate (1 – specificity), indicating how often no-risk sections are misclassified as areas with high risk. The y-axis represents the true positive rate (sensitivity or recall), indicating how often actual TBM jamming sections are correctly classified as high risk. The CJR scoring system has achieved ROC AUC of 0.978, demonstrating excellent performance.

In overall, the system achieved an accuracy of 0.996, sensitivity (recall) of 1.00, specificity of 0.996, precision of 0.733, and F1-score of 0.846. The 95% CI for sensitivity and specificity were 1.00 and 0.99, respectively. These results indicate that the CJR scoring system reliably identifies high-risk sections while maintaining a low false-positive rate.

For the visualization of risk level in different tunnel sections, a color-coded scheme has been implemented along the entire tunnel alignment for both BBDM and SMDM projects. In class V, potential jamming zones are highlighted using yellow triangular markers with red borders, whereas actual TBM stuck sections are denoted by black X-shaped markers. The medium, low, and no-risk levels are represented by magenta star markers, blue square markers, and green circular markers, respectively. Additionally, data points corresponding to class IV and class V are represented by gray circles along the tunnel alignment. Furthermore, variability in torque, thrust, and PRnet is also illustrated in the background using corresponding thresholds. Percentile-based variability categories such as highly variable, moderately variable, slightly variable, and normal conditions are color-coded as red, orange, blue, and gray, respectively. A detailed risk level classification alongside corresponding data variability conditions is presented in Figs. 12 and 13.

As illustrated in Figs. 12 and 13, the torque data points largely fall within the normal variability zone, suggesting no indication of potential TBM jamming events. However, this prediction contradicts the actual tunnelling conditions and fails to identify real TBM jammed section. The variability pattern seen in thrust and PRnet provides a meaningful indication of upcoming jamming events, particularly in very poor rock mass conditions. Data points exhibiting medium to high variance correspond to high-risk conditions, which are distinctly flagged by yellow triangular markers with red borders. These alerts are typically observed at least one ring prior to the actual jamming sections. At the locations of actual TBM jamming, the high-risk conditions are marked byblack X-shaped markers.

The results demonstrate that the color-coded alarm system offers an effective early warning mechanism for the TBM tunnel crew. It supports continuous risk monitoring and enhances decision-making processes by enabling the timely implementation of preventive measures to reduce the possibility of TBM jamming.

Empirical range of parameters

Safety is a primary prerequisite in the tunnelling process. During tunnel boring, TBM operators typically adjust control parameters based on historical performance data and real-time monitoring of TBM response to varying geological conditions. However, relying solely on prior experience and observational judgment may be insufficient in complex geological environments. As presented above, the CJR scoring system has demonstrated the ability to raise red flag warnings at least one ring in advance of potential jamming events. Notably, all recorded TBM jamming/stuck events occurred mainly in class V rock mass quality conditions.

In this regard, an empirical control system could be useful to map potentially hazardous tunnel sections. The statistical analysis indicated that medium to high risk levels are associated with moderate to high variability in geological and machine parameters under which TBM jamming was observed. Conversely, a low risk of TBM jamming was found in cases of slight variability. Therefore, operational conditions ranging from normal to slightly variable can be considered safe for tunnelling operations, and a threshold up to slight variability may be used to define safe working conditions. Based on the findings of this research, an empirical control range for key TBM input and response parameters has been established between the 25th and 75th percentile values derived from the cross-project TBM database, which are summarized in Table 12.

Table 12 Empirical range for TBM control and response parameters with corresponding rock mass class.

Utilizing these empirically defined ranges can support TBM operators in making adaptive adjustments and data-informed decisions during tunnel excavation in the challenging geological conditions of the Himalaya.

Source link