Baseline characteristics and correlation patterns of clinical indicators
This study was conducted with 520 patients having severe burns. After preprocessing data and feature engineering, an overall dataset, including demographics, burn features, and biochemical indicators, was built (Fig. 1A). Total Body Surface Area (TBSA) exhibited a skewed distribution across the cohort with a broad range of injury severity typical of severe burn groups of patients (Fig. 1E).
Of the 520 participants, 324 (62.3%) were diagnosed with hypertrophic scarring (HS group), while 196 (37.7%) were not (Non-HS group). To mitigate the impact of this class imbalance, the scale_pos_weight hyperparameter was tuned during the XGBoost modeling process to penalize misclassifications of the minority class. The patients who developed HS were identified by a heightened condition of metabolic consumption and systemic inflammation (Fig. 1B). The HS group exhibited significantly elevated systemic inflammatory markers compared to the Non-HS group, specifically C-reactive protein (92.09 ± 28.69 vs. 54.53 ± 16.11 mg/L, P < 0.001) and White Blood Cell count (12.32 ± 3.05 vs. 9.64 ± 2.24 × 10⁹/L, P < 0.001). Additionally, serum albumin levels were markedly lower in the HS group (36.61 ± 3.92 vs. 39.93 ± 3.35 g/L, P < 0.001), reflecting a more severe catabolic state. In terms of the severity of injuries, the HS group recorded a much higher mean TBSA (45.18% vs. 33.00) and delayed the first surgery (9.47 vs. 8.18 days). A comprehensive comparison of baseline characteristics is presented in Table 1.
The categorical variables distribution demonstrated the risk factors of high priority (Fig. 1C). Infection of the wound became a highly discriminating variable; in the HS group (151 cases) the prevalence (or incidence) of infection was disproportionately higher than in the Non-HS group (10 cases), indicating persistent infectious inflammation is a major scarring causative agent. Moreover, joint-related burns as well as flames-related burns were more commonly found in the HS sample.
The analysis of the correlation matrices revealed a strong cluster of severity-related metrics (Fig. 1D). TBSA showed high levels of positive correlation with CRP (r = 0.74), and with full-thickness burn area (r = 0.68) and LDH (r = 0.63). Such interconnectivity implies that the large areas of the burn surface should not only be used as an anatomical index but be at the core of the systemic inflammatory cascade and tissue destruction. On the other hand, there was a high negative correlation between CRP and albumin (r = −0.61), which further supported the association between CRP and metabolic dysregulation in severe burns.

Study overview and baseline data distribution. (A) The workflow of the study, illustrating the process from patient enrollment, data preprocessing, feature engineering, and machine learning modeling to clinical application. (B) Violin plots comparing the distribution of key continuous clinical indicators (Age, Albumin, CRP, TBSA, Time to First Surgery, WBC) between the Hypertrophic Scar (HS) and Non-HS groups. (C) Stacked bar charts showing the proportional distribution of categorical variables (Burn Location, Cause of Burn, Sex, Wound Infection) between groups. (D) Pearson correlation matrix heatmap of clinical features, with red indicating positive correlation and blue indicating negative correlation. (E) Density plot showing the data distribution of Total Body Surface Area (TBSA) across the entire cohort.
Construction and performance benchmarking of machine learning models
In order to find out the best algorithm to predict the risk of hypertrophic scarring after severe burns, we conducted a stringent benchmarking of four standard machine learning algorithms with different computational architectures: Generalized Linear Model (GLM), Random Forest (RF), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost).
In the comparative analysis of Receiver Operating Characteristic (ROC) curves (Fig. 2A), XGBoost exhibited superior discriminatory power (AUC: 0.905, 95% CI: 0.865–0.945), which was statistically significantly higher than that of the Generalized Linear Model (AUC: 0.868, 95% CI: 0.820–0.916; DeLong test P = 0.042). While GLM and RF performed competitively, XGBoost demonstrated superior stability in maintaining high specificity in high-sensitivity regions, confirming its robustness in handling complex clinical data. Since medical data has an inherent class imbalance, we also assessed the performance of the models using Precision-Recall (PR) curves (Fig. 2B). Given the inherent class imbalance, these findings were further strengthened by the Precision-Recall analysis (Fig. 2B), where XGBoost achieved the highest AUPR of 0.945, significantly surpassing GLM (AUPR = 0.906) and SVM (AUPR = 0.930). This indicates that the model possesses high discriminatory capacity to reduce the number of false positives; simultaneously, it ensures that high-risk patients are rarely missed, thereby maximizing the use of medical resources.The confusion matrices gave a finer analysis of the performance of the classifier on the test set (Fig. 2C). XGBoost was able to successfully detect the overwhelming majority cases of scar (high true positive rate) with minimal rate of false alarms. It indicates that the model is not only statistically better but also highly practicable to simulated clinical diagnostic cases.
The learning curves were generated to verify the generalization capability of the model and eliminate the possibility of overfitting (Fig. 2D). This can be seen as the two training and testing sets experiencing a gradual convergence in the AUC scores as the sample size of the training increased with the difference in the scores being within a reasonable range. This tendency proves that the model has acquired underlying risk statistics, despite being able to memorise the training data. Moreover, hyperparameter optimization through Bayesian Optimization (Fig. 2E) has shown that the model performance converged after approximately 25 optimization iterations (best score > 0.98), which means that the values of the parameters are close to optimal ones.
This multidimensional assessment led to the choice of XGBoost as the dominant algorithm to be used in further interpretation analysis and clinical device creation as this evaluation method includes a range of diverse advantages in its precision, strength, and generalization.

Construction and multidimensional performance evaluation of machine learning models. (A) Comparison of Receiver Operating Characteristic (ROC) curves for four machine learning models (GLM, RF, SVM, XGBoost) on the test set, with XGBoost demonstrating the highest AUC. (B) Comparison of Precision-Recall (PR) curves to evaluate model performance on the imbalanced dataset. (C) Confusion matrices of the optimal model (XGBoost) on the testing and training sets. (D) Learning curve showing the trend of AUC on training and testing sets as the sample size increases, assessing potential overfitting. (E) Visualization of the Bayesian Optimization process; the red line tracks the best score (AUC) achieved across iterations.
Identification of key predictors and non-linear risk effects based on SHAP values
To explain the rationale behind the decision-making of the XGBoost model and measure the impact of each clinical feature on the prediction result, SHAP (SHapley Additive exPlanations) algorithm was used. Figure 3 A reveals that C-reactive protein (CRP) was the most important predictor of the global feature of importance and had the largest mean absolute SHAP (1.43). The next in the order was Full-thickness burn area (mean SHAP 0.41) and Total Body Surface Area (TBSA, mean SHAP 0.32), which proves the statement that systemic inflammation levels and injury severity are the fundamental component of the discrimination offered by the model (Table 2).
The directional relationship between feature values and predicted risk was also demonstrated using the SHAP summary plot (Fig. 3B). The high feature values (marked in red) distribution of CRP, TBSA, and WBC were located mainly in the positive portion of the SHAP value space, showing the positive association between the high level of these markers and the occurrence of HS. On the other hand serum albumins levels showed a significant negative correlation; the higher the albumin levels the more negative the SHAP value indicating that the ideal nutritional status may have a protective effect on scar formation.
The dependence plots on SHAP indicated that the key continuous variables had non-linear relations with risk. The dependence plot of CRP (Fig. 3 C) exhibited a distinct model-derived inflection point, as at lower levels of CRP (below 80 mg/L), the effect on model output was minimal; but, above the inflection point (around 120 mg/L) the dependence curve began to increase at a rapid rate and continued to increase gradually. This observation implies that CRP has the greatest marginal influence on prediction of risk in the range of 80–120 mg/L.
On the contrary, a positive correlation was found between the full-thickness burn area and SHAP values with a relatively linear concentration (Fig. 3D). There was gradual increase in the risk contribution with the rising number of full-thickness burns. The results showed that SHAP values were positive when the full-thickness of the area of burns exceeded 30%, meaning that extensive deep burns are great risk factors of HS.
The categorical variable Wound Infection of Fig. 3E was analyzed revealing that the SHAP values were significant with the cohort with infection (Present) having much higher values than the non-infected group (Absent), and that values generally ranged throughout the positive range. Such a clear differentiation measures the position of wound infection as a risk-enhancing factor in the pathogenesis of hypertrophic scarring.

Global model interpretation and key feature analysis based on SHAP values. (A) Global feature importance bar plot ranking features by mean absolute SHAP value, identifying CRP and Full Thickness Burn Area as the top predictors. (B) SHAP summary beeswarm plot illustrating how high or low feature values (color-coded) impact the model output (SHAP value on the x-axis). (C-E) SHAP dependence plots for top features (CRP, Full Thickness Burn Area, Wound Infection), revealing non-linear relationships and specific thresholds affecting HS risk.
Deep excavation of feature interaction effects and individualized risk Deconstruction
Although single-feature contribution analysis gives a background, risk profile, clinical factors may have complicated, non-linear, synergies. To reveal this latent pathophysiological network, we built a feature relationship heatmap using the computation of SHAP interaction values (Fig. 4A). The matrix showed important interaction effects (strength of interaction > 0.15) between Age, Sex, and Full-Thickness Burn Area, which indicated that the effect of demographic factors might have moderating effects on the pathogenic potential of the severity of the injury on scar formation.
Plots of bivariate dependence of interaction were produced to explain these modulatory processes. The interaction analysis between TBSA and Age (Fig. 4B) showed that SHAP values were increasing with the size of the burn in all age groups, but more so intense in younger patients (below 30 years, purple dots). The implication of this is that in relation to their similarly sized burn area, younger patients will be at a disproportionately higher risk of hypertrophic scarring probably due to their more active cellular proliferation and immune response capabilities.
Likewise, the status of wound infection drastically changed the pattern of risk as per the timing of the surgery (Fig. 4C). In the non-infected patients (red curve), SHAP values did not increase and even reduced with an increase in time of the first surgery (5 to 12 days). Conversely, with concurrent delay in surgery among patients with wound infection (blue curve), the SHAP values showed an upward trend fluctuating about a mean value. This observation indicates that early surgical intervention can be more important in breaking the pathological progression of scar formation in case of an infection.
The synergistic effect of TBSA and CRP was graphically illustrated further with joint risk surface analysis (Fig. 4D). The predicted risk probability Risk Prob was still low at values below 0.1 (blue region) when TBSA was below 40% and CRP was below 50 mg/L. But at jointly high levels of both indicators (TBSA above 60% and CRP above 100 mg/L), the predicted risk probability exceeded 0.9 (or deep red area). This non-additive synergistic improvement highlights the importance of simultaneous evaluation of extent of anatomical injury and systemic inflammatory load in patients with severe burns, such as dual monitoring.
Lastly, to indicate the ability of the model to interpret at a precise individual level, a typical high-risk patient (ID 3) was chosen to be locally explained (Fig. 4E). However, this patient presented with a moderately extensive burn (TBSA = 52.1%), but the level of CRP was extremely high (129.12 mg/L) with wound infection. This process was quantified by the SHAP waterfall plot: the critically high CRP added + 2.85 to the log-odds, which is much larger than that of TBSA (+ 0.90). In spite of the fact the location of burn (Limbs) provided even some protection (−0.35) it was not enough to overcome the acute inflammatory impulse and eventually the model was able to indicate that this patient was indeed at high-risk as it was.

Deep interaction analysis and local explanation. (A) Feature relationship map (SHAP correlation heatmap) revealing collaborative patterns in feature contributions. Abbreviations: TBSA: Total Body Surface Area; CRP: C-reactive Protein. (B) Interaction dependence plot of TBSA vs. Age, showing how age modulates the impact of burn area on risk (color scale indicates age). (C) Interaction analysis between Surgery Time and Wound Infection, illustrating how infection status alters the risk trend associated with surgical timing. (D) Joint risk surface plot of TBSA and CRP, where red intensity represents the predicted risk probability, highlighting the synergistic effect of severe burns and high inflammation. (E) Local explanation bar plot for Patient ID 3, decomposing the specific contribution of each feature to the final risk score (red bars increase risk, blue bars decrease risk).
Clinical decision analysis and evaluation of model utility
Notwithstanding an excellent statistical outcome of the XGBoost model, its application in clinical practice required translation evaluation through Decision Curve Analysis (DCA). Figure 5A shows that, machine learning model (red line) was always found to have a greater net benefit than the treat-all (gray line) and treat-none (black line) strategies when the threshold probability is under a comprehensive range of probabilities (0.01 to 0.85). This means that irrespective of the tolerance that a clinician has to missed diagnoses (i.e. whether one is aggressive or conservative on this treatment threshold) decision-making that is informed by this model gives more clinical utility than the conventional one-size-fits-all methods. In addition, the multidimensional ML model exhibited considerable incremental value, especially where the probability threshold exceeds 0.2, which validates the fact that combining multi-source clinical data is required to effect accurate risk stratification.
Additional efficacy of the model was tested through the Clinical Impact Curve (CIC, Fig. 5B) in terms of public health. The curve of the number of patients predicted as high-risk (red) and the curve of the number of detected positive cases (blue) exhibited high-concordance in the high-specificity region (threshold probability > 0.5). A convergence value of this means that as intervention thresholds are increased the model maintains a high positive predictive value, effectively isolating really high risk patients and minimizing the overtreatment of low-risk populations.
Stratified ROC analysis was performed to assure that the models could be robust in the wide range of patient profiles (Fig. 5C). It was found that the model could exhibit high AUC (> 0.85) in both of the age stratifications (> = 40 vs. <40 years) and size of burn (TBSA 40 vs. less) groups. This kind of consistency across subgroups indicates that the model is not skewed towards certain demographic or particular levels of injury, which suffices to suggest its broad useability in heterogeneous patients with burns.
Analysis of errors (Fig. 5D) identified the possible blind areas of the model by comparing the profiles of features of the correct and incorrect predictions. Misclassified cases (red outline) compared with correctly predicted cases (green outline) had lower normalized means on inflammatory markers (CRP and WBC) but did not have significant differences in TBSA. This implies that the model might have problems with non-typical cases that are fulfilled with massive burns but blunt inflammatory reactions, thereby making clinicians to be cautious of such cases by subjecting them to secondary validation.
Lastly, optimum classification cutoff based on Youden index (Fig. 5E) was used to find the optimum cutoff of 0.504. The most desirable trade-off based on the sensitivity and specificity is at this threshold that offers a statistically based reference standard on practical clinical use.

Clinical utility evaluation and decision analysis. (A) Decision Curve Analysis (DCA) comparing the net benefit of the ML model (red line), a reference model (TBSA only, blue dashed line), and “Treat All/None” strategies. (B) Clinical Impact Curve (CIC) showing the number of predicted high-risk patients (red curve) versus actual positive cases (blue curve) across probability thresholds. (C) Subgroup ROC analysis validating model robustness across different age (≥ 40 vs. <40) and TBSA stratifications. (D) Error analysis radar chart comparing normalized mean feature values between correct predictions (green) and mispredictions (red) to identify model weaknesses. (E) Sensitivity-Specificity trade-off plot, identifying the optimal classification threshold (0.504) based on the Youden index.
Construction and individualized application of the intelligent clinical decision support system (iCDSS)
To overcome this discrepancy between the complicated machine learning algorithms and clinical practice, Intelligent Clinical Decision Support System (iCDSS) based on the XGBoost kernel was designed. The prototype interface is displayed in Fig. 6A, in which the clinician is allowed to enter regular parameters about the patient (e.g., TBSA, Age, CRP) to receive real-time, quantified threat conditions. This end-user-friendly non-linear design facilitates easy incorporation of the advanced non-linear model within clinical workflows so as to offer point-of-care decision support.
Its system reliability was strictly tested through calibration analysis (Fig. 6B). The reliability diagram showed that there was an outstanding agreement of foresight in probabilities and event rates, and the points of calibration assumed a lot of fit with the optimal diagonal. It is interesting to note that the predicted probabilities closely aligned with observed frequencies along the optimal diagonal. Notably, in the highest-risk bins (Bins 5–10, where predicted probability > 0.9), the observed incidence of HS reached 100%, while in the lowest-risk bins (Bins 1–3, predicted probability < 0.1), the incidence was 0%, indicating excellent calibration. This high calibration is an indication that the probability outputs are not only discriminative rankings but also have true epidemiological meaning.
In order to reach the goal of precision profiling, SHAP-based deconstructed risk in the form of a person is introduced in the iCDSS. As with a representative high-risk patient (Fig. 6C), whilst Age (−0.22) and Sex (−0.13) did have some slight protective effects, they were, respectively, greatly surpassed by strong risk factors. In particular, burns located on the limbs contributed a disproportionately high risk increment (SHAP value + 3.28). This factor, synergizing with elevated platelet counts (+ 0.64) and blood glucose (+ 0.53), drove the patient’s cumulative risk toward the maximum. This illustration helps clinicians to identify the primary pathogenic drivers unique to every patient.
On the other hand, in the attribution analysis of low-risk case (Fig. 6D), for this low-risk patient, the time to first surgery exerted a decisive protective effect (SHAP value − 2.66), effectively neutralizing the modest potential risks associated with Age (+ 0.45) and BMI (+ 0.14). This indicates that timely surgical intervention was the most critical factor in arresting scar formation for this specific case. With the definition of this interaction of risk and protective variables on an individual level, the iCDSS will enable the paradigm shift of intervention measures based on an empirical approach to the one of data-driven personalized medicine.

Construction and individualized application of the Intelligent Clinical Decision Support System (iCDSS). (A) Prototype interface of the web-based intelligent risk calculator, displaying a real-time HS risk probability (82%) and high-risk alert based on input parameters. (B) Calibration curve (Reliability Diagram) of the XGBoost model, showing high consistency between predicted probabilities and observed frequencies (aligning with the diagonal). (C) Individualized explanation card for a high-risk case (Patient ID 3), visually highlighting the primary drivers of high risk (e.g., Burn Location, PLT). (D) Individualized explanation card for a low-risk case (Patient ID 1), showcasing protective features (blue bars).
