A causal and interpretable machine learning framework to support risk prediction and surgical decision making after cranioplasty

Machine Learning


ethics statement

This study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board of Shandong University Qilu Hospital (KYLL-202407-041), Army Medical University Taiping Hospital. [(2024) 293]Tandu Hospital (K202411-26). As a retrospective study, the requirement for informed consent was waived. This study is registered at ClinicalTrials.gov (NCT06740773, registered December 18, 2024) and was conducted in accordance with the Transparent Reporting of Multivariable Prediction Models for Personal Prognosis or Diagnosis with Artificial Intelligence (TRIPOD + AI) guidelines.26. Patients or the public were not involved in the design, conduct, reporting, interpretation, or dissemination of this study.

Study population

This multicenter retrospective cohort study included patients of all ages who underwent cranioplasty in the neurosurgery departments of three independent hospitals. The derivation cohort consisted of patients who underwent cranioplasty at two independent tertiary hospitals (Shandong University Qilu Hospital and Army Medical University Taiping Hospital) from January 1, 2015 to July 31, 2023. This cohort was used for model training and internal validation with 5-fold cross validation. The geographic external validation cohort included patients who underwent cranioplasty at another independent tertiary hospital (Tangdu Hospital, Air Force Medical University) during the same period. The temporal external validation cohort included patients who underwent cranioplasty at Shandong University Qilu Hospital from August 1, 2023 to January 1, 2025. Individuals with a history of previous cranioplasty, severe comorbidities (including significant cardiac, hepatic, renal, or immune system dysfunction), congenital cranial defects, or substantial missing data were excluded from the study. The primary endpoint of this study was the occurrence of in-hospital complications after cranioplasty.

Sample size estimation

To determine the minimum sample size needed to develop a clinical prediction model, we followed a four-step approach using the pmsampsize package in R and the web tool BeyondEPV (https://mvansmeden.shinyapps.io/BeyondEPV).27. The following parameters were prespecified: number of prediction parameters = 9, shrinkage rate = 0.9, resulting prevalence = 0.2, minimum acceptable C-statistic = 0.8, and mean absolute percent error (MAPE) = 0.05. Based on these criteria, the required minimum sample size was estimated to be 400.

Data collection and quality control

Both factors influencing postoperative complications after cranioplasty and specific outcome variables of complications were identified through a comprehensive literature review, including systematic reviews, meta-analyses, primary studies, and expert clinical opinions.28, 29, 30. Postoperative complications were classified into six major categories: infection, intracranial hemorrhage, fluid retention, hydrocephalus, seizures, and pneumocephalus. In this study, only complications that required active clinical intervention were counted as clinically significant complications. Minor or self-limiting abnormalities were not considered complications. In addition, reoperation, defined as a return to the operating room to remove implanted material due to severe complications, was also recorded as another outcome. Data was extracted from the patient’s EMR. Detailed information regarding factors and complications is provided in Table S1.

A standardized multicenter database was established to ensure data consistency and reliability. Data collectors were uniformly trained and clinical information was recorded using standardized forms. After data collection, an intercenter quality control process was implemented. A random 30% sample of records from each center was independently reviewed by neurosurgeons from other participating centers. Data accuracy of at least 90% was required. Centers that did not meet this threshold were required to re-evaluate and correct their data.

Data preprocessing

To handle missing data, the extent and pattern of missingness across all variables was first assessed (Figure S1). The missing mechanism for the skull defect area variables was first assessed by clinical experts and then statistically verified, supporting the assumption of random and complete missingness (Table S2).31,32. We then used cross-validation to evaluate six commonly used imputation methods, with standardized root mean square error (NRMSE) as the primary evaluation metric. Among these, missForest33 It was finally selected for subsequent analysis due to its excellent performance in minimizing NRMSE (Fig. S2). Sensitivity analyzes were conducted to verify that the imputation process maintained consistency in the data distribution and did not introduce bias, as expected under the assumptions (Figure S3). Outliers were identified using the Adam optimizer and a four-layer autoencoder trained with mean squared error loss. Samples with reconstruction errors above the 2σ threshold were considered outliers and excluded from further analysis.34 (Figure S4). The impact of this filtering step on model performance was evaluated through sensitivity analysis as detailed in Table S3. Continuous variables were standardized using Z-score normalization, and categorical variables were standardized by one-hot encoding.35. To avoid data leakage, all preprocessing steps were applied separately to the derivation cohort and the geographic external validation cohort. A temporary external validation dataset was left unprocessed to reflect the actual deployment situation.

Feature selection

To address multicollinearity resulting from highly correlated features, we applied a correlation filter to remove redundancy and ensured that all pairwise correlation coefficients were less than 0.6. We then calculated the variance inflation factor to ensure there was no multicollinearity between the remaining variables (Fig. S5). Feature selection was performed using four different algorithms. Boruta36lasso37Random Forest-based Recursive Feature Elimination (RF-RFE)38Genetic Algorithm (GA)39,40 (Figure S6). The final feature set was determined by identifying the intersection of variables selected by all four methods and visualized using a Venn diagram.41. Then, experienced neurosurgeons from three centers reviewed the selected features and finalized the predictors that ensured high face validity and ease of implementation.

Model development, comparison and evaluation

Fifteen ML models were developed to predict the risk of complications after cranioplasty: GAM, Logistic Regression, Gradient Boosted Decision Tree, K-Nearest Neighbor, Optical Gradient Boosting Machine, RotF, Extreme Gradient Boosting, Naive Bayes, Adaptive Boosting (AdaBoost), Multilayer Perceptron, Support Vector Machine, Decision Tree, Extremely Randomized Trees (ExtraTrees), Gaussian Process Classifier and Random Forest (RF). We performed 5-fold cross-validation on the derivation cohort to minimize overfitting and increase model robustness before external validation. The optimal cutoff for the model was determined by maximizing the Youden index (sensitivity + specificity − 1). 95% confidence intervals were estimated using the bias-corrected and accelerated (BCA) bootstrap method.42.

We defined new metrics to select the best predictive model for each complication. \({\mathrm{AB}}_{\mathrm{Score}}\)the area under the receiver operating characteristic curve (AUROC) and Brier score are combined to evaluate both the discriminative and calibration abilities of each model. Performance metrics from the training cohort were not used to evaluate the model to avoid overly optimistic estimates derived from the training data. To reliably assess the generalizability of each model, only internal cross-validation metrics and external validation metrics were considered.

The formula is \(A{B}_{{Score}}\) is expressed by the formula. (1)

$$A{B}_{score}=\alpha * \overline{AUROC}+\alpha * (1-\overline{Briers\,core})$$

(1)

where \(\overline{AUROC}\) and \(\overline{Briers\,Core}\) represent the arithmetic mean of AUROC values ​​and Brier scores from the internal cross-validation set and the geographic external validation set, respectively. \(\alpha\) is the weighting factor, depending on the settings \(\alpha =0.5\)where equal emphasis is placed on identification and calibration. \(A{B}_{\mathrm{score}}\). \(1-\overline{Briers\,Core}\) is used to align with the positive direction of AUROC.

The discriminatory ability of the final model was evaluated using receiver operating characteristic (ROC) curves and precision-recall (PR) curves, and calibration was evaluated with Brier scores and calibration curves.43. Additionally, decision curve analysis (DCA) was used to evaluate the net benefit of the model at different thresholds.44. Additionally, to assess model fairness, we evaluated model performance across demographic subgroups including age (<40 years and ≥40 years) and gender (male, female) within the external validation cohort.45.

Complication-specific modeling

Separate machine learning models were developed for each type of major postoperative complication. To address severe class imbalance in complication-specific modeling, ADASYN46 was applied to the data to enrich the minority class sample and improve the model’s sensitivity to rare outcomes.

Model explanation and causal inference analysiss

To address the inherent opacity of machine learning models, we adopted a game-theoretic approach, SHAP, and used SHAP values ​​to quantify the contribution of individual features to the deviation from the average prediction.47. Additionally, we applied PDP to visualize the interactions between the key features identified in SHAP and the impact of their combination on model predictions.48,49.

A two-step approach was adopted to investigate the potential causal effects of modifiable surgical factors on postoperative complications. First of all, DiCE50 It was used to simulate hypothetical adjustments of clinically actionable variables and to investigate whether such changes would affect the risk of complications predicted by the model. We then applied causal inference techniques to quantify the influence of these variables. We estimated ATE using DML51Meanwhile, the T-Lerner framework was adopted to assess CATE across patient subgroups.52. Sensitivity analyzes were conducted to assess the robustness of the causality estimates.

statistical analysis

Patient data were categorized into continuous and categorical variables. Normality of continuous variables was assessed using the Kolmogorov–Smirnov test. Normally distributed variables were expressed as mean ± standard deviation, whereas skewed distributed variables were expressed as median with interquartile range. Categorical variables were reported as numbers and percentages. Data analysis was performed using IBM SPSS Statistics for Windows (IBM Corp., 2019 release, version 26.0), Python (version 3.8.2), and R (version 4.2.2).



Source link