Machine learning algorithms for prediction of cerebrospinal fluid leakage after posterior surgery for thoracic ossification of the ligamentum flavum

Analysis of importance features

The treatment of symptomatic ossification of the thoracic ligamentum flavum with posterior laminectomy is very complicated and accompanied by many surgical complications such as dural tear and CSFL, which significantly increases the difficulty of surgery and the risk and greatly affect the patient’s prognosis and patient satisfaction with surgery. Machine learning models leveraged diverse data modalities—demographics, radiographic parameters, and surgical metrics—to predict CSFL. To capture both global and local drivers of risk, we applied three complementary interpretability techniques across two resampling strategies (SMOTE and ADASYN): Model coefficients (linear weights for LR and SVM; gain or impurity reduction for tree ensembles), SHAP values, quantifying each feature’s average contribution to model output, and LIME explanations, revealing how small perturbations in individual cases shift predicted CSFL probability.

Across all methods and sampling schemes, four features emerged as the most robust predictors:

Multi‑segment involvement: Consistently highest in coefficient and SHAP rankings, reflecting that ossification spanning ≥ 2 levels demand larger bony resections and more extensive dura manipulation, mechanically elevating tear risk. Intraoperative blood loss (IBL): Large positive coefficients and SHAP contributions indicate that heavy bleeding degrades visualization and increases tissue stress, corroborated by LIME’s local weights showing that even moderate increases in IBL substantially raise CSFL probability. Operation time: Prolonged surgeries similarly degrade operative conditions; OT ranks among the top three features in both global and local explanations. Although surgical time and have been explicitly studied in relation to complications after spinal decompression [8]. Spinal canal encroachment ratios (especially RrSCA, RrPD, RrDCM): Cases with residual canal area below 50% or paramedian residual diameter under ~ 45% consistently pushed model outputs toward higher CSFL risk, as evidenced by pronounced SHAP shifts in both SMOTE and ADASYN models.

Beyond these primary drivers, several secondary factors showed moderate but nontrivial importance:

Duration of symptoms and diabetes history: Although prior studies did not flag these as CSFL predictors, our models—especially under ADASYN—assigned positive weights to longer duration of symptoms and diabetic status, suggesting that prolonged cord compression and microvascular changes may subtly increase dural fragility [28]. According to a multivariate regression analysis by Ahmet Kinaci’s team [29], younger age, male, higher body mass index, smoking history were associated with increased incisional CSFL risk. However, duration of symptoms and diabetes have not been shown to be important factors predictive of postoperative CSFL. One possible explanation is that these variables are actually meaningful, and that the ML algorithm captured a complex, nonlinear association between these features that was not detected by previous works [30]. In addition, a longer duration of symptoms often means a longer period of spinal cord compression, and the progression of ossification often makes the risk of CSFL greater [31]. Dural ossification: Ranked lower in aggregate importance but featured among the top three predictors in XGBoost and LightGBM under SMOTE, highlighting model‑specific sensitivity to combined ossification patterns. Notably, some studies [32, 33] have shown DO to have an impact on postoperative outcomes, especially when DO and ligament ossification are present at the same time, which significantly increases the risk of dural tear and CSFL, greatly affecting the patient’s prognosis. Decompression instrument (DI‑1, DI‑2, DI‑3): Linear models and tree gains ranked traditional bone chisels and high‑speed drill above piezosurgery, a novel finding that warrants further clinical investigation. The traditional bone chisels are associated with high labor intensity and prolonged decompression time, which may exacerbate mechanical compression-induced injury to the spinal cord within the canal [34].In contrast, the high-speed drill significantly enhances operative efficiency while reducing operator workload. However, its use requires advanced technical proficiency, as improper handling may lead to direct trauma to the spinal cord and dura mater [35]. Additionally, the high-speed drill carries risks of thermal injury to neural structures [36] and potential damage to adjacent soft tissues. In recent years, piezosurgery has gained widespread adoption in spinal surgical procedures. This innovative technique utilizes high-frequency micro-vibrations of the cutting blade to achieve precise and safe osteotomy. The technology offers several theoretical advantages, including exceptional cutting precision, superior tissue selectivity, minimal neural tissue trauma, reduced operative duration, and decreased intraoperative blood loss.

These insights suggest tailored strategies at each perioperative stage:

Preoperative planning: Patients with MuS, deep stenosis (e.g. RrSCA < 50%), long symptom duration or diabetes should receive more detailed preoperative risk stratification. Furthermore, those patients may benefit from piezosurgery decompression to minimize risk of dural tear. Intraoperative monitoring: Real‑time tracking of cumulative blood loss and elapsed time against model‑derived thresholds can trigger dural safeguard protocols—such as staged decompression or early dural inspection—when key metrics cross critical cutoffs. Postoperative risk stratification: Incorporating calibrated probability estimates (via isotonic regression under SMOTE) will refine individualized CSFL risk scores, supporting shared decision‑making and targeted follow‑up.

By uniting traditional coefficient analysis with SHAP’s game‑theoretic attributions and LIME’s local fidelity, this integrated assessment delivers a nuanced, actionable feature hierarchy that aligns mechanistic understanding with data‑driven risk prediction.

Analysis of the causes of model performance

In this study, we analyze theoretical and practical factors shaping model behaviour to understand the differing performances observed across models and sampling strategies. The inherent class imbalance in the dataset—only 31.8% of cases exhibited CSFL—necessitated the use of oversampling strategies to ensure adequate representation of the minority class during training. We applied two well-established techniques, SMOTE and ADASYN, and evaluated their impact across five classifiers.

Although oversampling nominally balances class distributions, the clinical context of CSFL prediction dictates that evaluation metrics must go beyond simple accuracy or AUC. In real-world practice, the cost of a false negative (i.e., missing a true CSFL case) is far higher than a false positive. Missed leaks can lead to meningitis, wound healing failure, or reoperation, while false positives may only prompt additional intraoperative inspection. Therefore, recall (sensitivity) and the F1 score, which balances recall with precision, are the most meaningful indicators of model value in this high-risk setting.

The results showed that SVM under SMOTE yielded the best balance of sensitivity and precision (F1 = 0.8889, recall = 0.881), maintaining strong generalization with minimal drop under ADASYN. SVM leverages its margin‑maximization principle:

$$\mathop {\min }\limits_{\omega ,b} \frac{1}{2}\omega^{2} + C\sum_{i = 1}^{N} \xi_{i} ,{\text{ subject to}}\,y_{i} \left( {\omega^{T} \phi \left( {x_{i} } \right) + b} \right) \ge 1 – \xi_{i} ,$$

where $\omega$ represents the weight vector, $b$ is the bias term, ${\xi }_{i}$ is the slack variable for handling misclassified samples, and $C$ is a regularization parameter, enables it to separate classes while minimizing misclassification errors effectively [20]. In other words, this robustness of SVM can be attributed to its margin maximization principle, which inherently resists overfitting to noise and focuses on a sparse set of support vectors. When synthetic samples generated by SMOTE fill in the minority manifold uniformly, the SVM benefits from a clearer margin boundary, enhancing recall without sacrificing precision. In contrast, ADASYN concentrates synthetic data generation on minority instances near class boundaries. This approach can help when minority samples lie in sparse regions but may also inject instability into models like SVM, resulting in slightly degraded F1.

LR, by comparison, consistently underperformed across both sampling methods. Its linear decision boundary cannot model the nonlinear interactions among demographic, radiological, and surgical features. The issue is exacerbated with ADASYN, where synthetic examples in complex regions of feature space may lie outside LR’s representational ability, leading to underfitting and poor recall.

Among ensemble models, RF achieved the highest AUC under SMOTE (0.9462), reflecting its strength in capturing complex variable interactions (Fig. 2). However, RF’s reliance on bootstrapped decision trees makes it more susceptible to overfitting in small-sample, imbalanced settings. When synthetic samples are concentrated via ADASYN near noisy class boundaries, the model may overly adjust to spurious splits, thereby reducing precision and F1.

Similarly, XGBoost, a gradient‑boosting framework, closely followed RF, optimizing:

$${\mathcal{L}}\left( \theta \right) = \mathop \sum \limits_{i = 1}^{N} l\left( {y_{i} ,\widehat{{y_{i} }}} \right) + \mathop \sum \limits_{k = 1}^{K} {\Omega }\left( {f_{k} } \right),$$

Where $l({y}_{i},\widehat{{y}_{i}})$ is the loss function measuring classification error, and $\Omega ({f}_{k})$ is a regularization term controlling model complexity [17]. XGBoost incrementally fits new trees to correct prior errors, performed strongly under SMOTE but declined under ADASYN (SMOTE F1 = 0.8537; ADASYN F1 = 0.8312). Its greedy optimization process can overcompensate in the presence of borderline synthetic points, leading to miscalibrated probability estimates. LightGBM, while efficient in large-scale data, was modestly outperformed by RF and SVM. Its leaf-wise splitting strategy may fail to generalize in small datasets, especially when positive-class signals are diluted by oversampling artefacts.

Crucially, these observations align with the underlying mathematical assumptions of SMOTE and ADASYN. SMOTE generates synthetic points by linear interpolation between minority-class neighbours, effectively regularizing the feature space and smoothing class boundaries. ADASYN, in contrast, assigns more synthetic points to “harder-to-learn” instances—those surrounded by majority-class neighbours. While this adaptive mechanism is beneficial in truly sparse minority regions, it may also amplify class overlap and introduce labelling ambiguity, which is particularly detrimental to variance-sensitive models like RF and XGBoost.

From a deployment perspective, these findings suggest that SVM combined with SMOTE sampling and isotonic calibration may be the most appropriate choice when minimizing missed CSFL cases is paramount. This combination demonstrated high recall, robust generalization, and well-calibrated probabilities for scenarios prioritizing precision—such as when overtreatment carries a high clinical or economic cost—RF under SMOTE may be favoured. While ADASYN is less effective in this dataset, it may remain valuable when minority subclasses are poorly represented or exhibit atypical patterns.

In conclusion, a model’s effectiveness in imbalanced medical prediction tasks depends not only on performance metrics but also on the interplay between data distribution, sampling strategy, and model structure. Our results emphasize the importance of aligning these factors to the clinical decision context and balancing sensitivity, interpretability, and reliability in high-stakes prediction.

Comparison with previous studies

Most previous studies predicting CSFL following thoracic decompression surgery have relied on traditional LR models, which offer limited flexibility in accounting for complex, non-linear relationships among patient characteristics, radiological findings, and intraoperative variables  [7,8,9]. While applicable for exploratory analysis, such models may not adequately reflect the multifactorial nature of CSFL risk, particularly in patients with multi-segment disease or severe spinal canal stenosis.

In this study, we implemented a broader machine learning framework incorporating advanced sampling strategies to address the class imbalance, probability calibration to ensure clinical interpretability of risk scores, and model explainability tools to clarify the role of individual predictors. To our knowledge, this is the first investigation to systematically evaluate combinations of SMOTE and ADASYN with Platt scaling and isotonic regression, offering insight into how different modelling strategies affect predictive accuracy, reliability, and transparency—qualities critical to adoption in clinical settings.

Beyond technical performance, our findings have direct relevance to preoperative risk stratification. The best-performing model—SVM with SMOTE resampling and isotonic calibration—demonstrated high sensitivity and excellent discrimination, making it well-suited for identifying patients at elevated risk of CSFL. Importantly, the output of this model is not a binary classification but a calibrated probability score that can be readily integrated into clinical decision-making workflows. For example, patients identified as high-risk could be candidates for enhanced dural protection techniques, staged decompression, or intensified postoperative monitoring. Furthermore, the predictors identified as most influential—multi-segment involvement, spinal canal encroachment parameters, operative time, and blood loss—are measurable pre or intraoperatively and modifiable to some extent through surgical planning.

It is also worth noting that while SVM with SMOTE emerged as the optimal combination in our cohort, this may not be universally true across all institutional contexts. In clinical settings where rapid inference, system resource constraints, or interpretability requirements differ, other models such as random forest, XGBoost, or LightGBM may offer advantages. Our findings thus provide a flexible foundation for tailoring predictive solutions to the needs of different surgical teams, patient populations, or health system infrastructures.

Nevertheless, this work remains a single-centre, retrospective study, and the generalizability of our results should be confirmed through external, multi-institutional validation. Moreover, while we focused on structured clinical and imaging features, future studies should explore whether advanced imaging analysis, such as radiomic texture features or deep learning–based segmentation, can improve prediction accuracy and support surgical precision.

In conclusion, by integrating robust machine learning approaches with clinically relevant evaluation strategies, our study offers a practical and interpretable tool for improving perioperative risk assessment in patients undergoing thoracic decompression. This approach holds promise for improving surgical outcomes and enhancing personalized care through data-driven decision support.

Limitations

Several limitations of this study warrant consideration. First, although the results demonstrate promising predictive performance, the model was developed and validated using data from a single academic centre. As such, its generalizability to other institutions with different surgical techniques, patient demographics, or imaging protocols remains uncertain. External validation across multiple centres and prospective clinical studies will be essential before routine clinical implementation can be considered.

Second, while the study incorporated a range of structured variables—including demographics, imaging-derived measurements, and intraoperative details—certain predictors, such as residual canal diameters and blood loss, may not be readily or consistently available across all settings. In particular, the reliance on high-resolution preoperative CT scans and standardized intraoperative documentation could limit the model’s scalability in resource-limited environments. Future iterations of this work may benefit from evaluating model performance using more universally accessible data inputs or automated imaging extraction techniques.

Third, although our machine learning framework outperformed traditional approaches, we did not include deep learning–based methods, such as convolutional neural networks or transformer-based architectures, which may be capable of capturing higher-order patterns in imaging or temporal surgical data. These methods may enhance predictive accuracy, particularly when combined with raw image inputs or intraoperative video data. However, their higher computational burden and limited interpretability pose challenges for near-term clinical use.

Fourth, the exclusion of 215 patients based on predefined criteria, while intended to ensure data consistency and quality, may have introduced selection bias. Differences in characteristics between included and excluded patients could lead to underestimation or overestimation of CSFL risk in certain subpopulations, thereby affecting the model’s generalizability. Although we conducted internal comparisons of baseline characteristics and applied robust validation techniques such as cross-validation method to mitigate this risk, the potential for bias remains. Future studies should systematically evaluate the impact of exclusion criteria on predictive performance.

Fifth, as a retrospective study, the absence of prospective validation limits the applicability of our model in real-world clinical workflows. While current constraints prevent immediate prospective analysis, we have enhanced internal validation and made our model publicly available to encourage external validation by other institutions. Additionally, we are actively planning multi-center prospective studies to further assess and refine our predictive framework.

Sixth, the model was trained using oversampling techniques to address the class imbalance, but some misclassification of CSFL cases remained—particularly in less typical presentations. While we employed established techniques such as SMOTE and ADASYN, the recall of rare variants of CSFL remains suboptimal, and future work could explore more advanced strategies, such as cost-sensitive learning, federated augmentation, or generative adversarial networks to simulate rare events.

Finally, the current model represents a static snapshot based on historical data. In real-world clinical practice, predictive tools must evolve alongside changing surgical standards, imaging quality, and patient characteristics. Mechanisms for periodic model updating and recalibration will be essential to maintaining relevance and reliability over time. Furthermore, ethical considerations—such as transparency, explainability, and equity across different patient subgroups—should remain a central focus in future deployment efforts.

In light of these limitations, our future work will focus on multi-centre collaboration, including richer data modalities and developing clinically integrated tools that can be adapted to diverse healthcare environments while maintaining interpretability and safety.

Source link