Machine learning to predict semimembrane core mismatch in acute ischemic stroke using clinical note data

Research Design and Patient Population

This was a retrospective model derived study using patient data from Mount Sinai Health System (MSHS), a New York City multisite, tertiary care hospital network, consisting of two academic medical centers and five community hospitals. The Mount Sinai Institutional Review Board approved the use of patient data for this study based on IRB #19–00738. MSHS is one of the best academically relevant stroke centers in New York metropolitan areas, with around 2400 admissions and 240 EVT procedures each year. The patient population in this derived study was drawn from another unpublished experiment in which a convolutional neural network (deep learning) model was developed to estimate CTP core volumes and penumbra volumes directly from CTH images. This separate analyses consisted of patients who underwent acute CTP imaging within 30 minutes of superficial CTH or CTA imaging due to suspected acute stroke from May 14, 2019 to June 15, 2021.

We excluded patients who had not evaluated a suspected stroke or who had primary intracranial (intraadrenal, subarachnoid, or subdural) hemorrhage. We also excluded patients without interpretable diagnostic CTP images, arterial occlusion, or significant semisubmacrovas, measured with an amount of attenuation to time greater than 6 seconds (TMAX). We also excluded patients who could not have caused symptoms due to LVO and those who had known chronic infarction in the perpetrator's arterial area.

measurement

Viz.ai – generated CTP Penumbra and core volumes of each patient were manually extracted using the Institutional Radiology Images and Archive Communication (PACS) Web Viewer and the P:C Ratio was calculated.²⁰. Based on previously published work, “Penumbra” was defined as the region of CTP at TMAX > 6 s compared to the contralateral brain hemisphere, and “Core” was defined as the relative CBF of less than 30% of the contralateral hemisphere.^{twenty one}. We trained the ML model using structured free text data and free text data, and classified the P:C ratio into one of two binary categories (>= 1.8 or <1.8).

Predictors – Structured Data

Clinical and sociodemographic variables were collected at the time of assessment, including gender, race, preferred language, medical comorbidity, and NIH Stroke Scale (NIHSS) scores from the institutional data warehouse. Using a series of patient history Clinical modifications of the International Classification of Diseases, 10th Edition (ICD-10-cm) Elixhauser comorbidity index for each patient was calculated when coded immediately before scan time. Categorical variables were encoded using a 1-hot vector.

Predictors – Unstructured Text

Free text data for all notes written within one week of each patient's CTP scan time were extracted. To reduce non-negative text (or “noise”), we limited the included text to a pre-specified character threshold. Notes approaching neuroimaging time likely contain clinically relevant information predicting outcomes, so we constructed a “patient-level corpus” by adding the entire reverse clinical notes starting with the CTP scan, starting with the CTP scan, in order, until a character threshold was reached. Here, the term “corpus” refers to a large collection of text or speech data used for NLP tasks^22,23.

This process resulted in a character-limited subset of each patient's text. Full notes were included without trimming, and only the sound was incorporated into the model if the note closest to the CTP scan exceeded the text threshold (Figure 3). We selected conservative character thresholds that captured the median note length in both the overall corpus and notes that were written as closest to the CTP scan.

**Figure 3: Text processing pipeline for generating document embeddings.**

Additionally, all patient-level corpus were subjected to a series of routine free-text preprocessing procedures, including converting all letters into lowercase, removing stop words, removing punctuation, removing stems to simplify words into basic forms, removing punctuation, and applying. To maximize the generalizability of the model and ensure that the algorithm did not learn patient-specific data (unique names, addresses, etc.), we excluded words unique to a single patient.

A combination of two NLP techniques was employed to convert the patient-level corpus of free text into a form that the model can ingest: term frequency inverse document frequency (TF-IDF) and word embedding. TF-IDF is a mathematical method of outputting a score indicating the frequency of a particular word within a particular patient's text corpus compared to all other patients' clinical notes^24,25. In contrast, Word Embedding encodes each word in a text document as a vector or as a string in high-dimensional space. This process ensures that words with similar meanings and usage are placed close to each other in this space. An effective analogy is the postal system. Each word is assigned an “address.” This allows words with related meanings and frequent co-occurrences to receive mathematically close “addresses” in semantic space. This allows you to capture semantic relationships between words in a quantifiable way, and is commonly used in NLP tasks.^26,27,28.

Due to the large variation in patient note formats, author types, abbreviations, and word counts, we employed a combination of specific approaches to represent clinical texts from patient notes in a consistent, low-dimensional format. We tried to combine signals implicitly with the meaning of individual words, and to combine signals in the way that those words are relatively distributed in all patient notes (Figure 3). To achieve this, each word from each patient-level corpus was first converted into a vector.²⁹.

Document-level vectors were then generated by summing and weighting patient-level vectors by each TF-IDF score. We then concatenated the structured data with the weighted document vectors and scaled all the features using linear transformations. Missing values were estimated using five non-adjacent assignment approaches.

Model development and performance measurement

We trained 10 different model architectures as features, including K-Nearest Neighbors, Support Vector Classifier, Decision Tree, Random Forest, Adaptive Boost, Gradient Boost, Gaussian Naive Bayes, Linear Identification Analysis, Quadratic Identification Analysis, Quadratic Identification Analysis, Extreme Gradient Boost (XGBOOST) models.^{18,30,31,32,33}. We also determined the sensitivity, specificity, accuracy, and F1 scores for each model, as well as the area under the receiver operating characteristic curve (AUROC). The best performance model was defined as a combination of AUROC-maximized model architecture and character thresholds. I used the bootstrap approach. This approach used 1000 random, layered 70/30 train test splits to generate the distribution of each evaluated performance measurement.

We also used an iterative approach to assess the performance of the model to maximize your index (a measure of diagnostic test accuracy calculated as sensitivity + specificity -1) and identify decision thresholds that report a confusion matrix of model average performance at that threshold across all bootstraps.

In addition to training the complete model on both structured and textual data, we also performed ablation analysis that trained two additional versions of the best performance model using structured and textual data input as a feature set. To assess whether different minimum character thresholds affect model performance, we also trained three different versions of the most performance model, three different versions of three different character thresholds.

Analysis of factors that influence model performance

Post hoc analyses were performed to identify samples that the model could not classify correctly across all bootstraps and to determine whether there were interpretable factors that contributed to these misclassifications. To achieve this, we calculated the average probability that each sample would be classified into a true class on all bootstraps. To explain the imbalance of a class, we normalized these probabilities in that class by scaling these probabilities within each class. This normalization ensured that intrinsic differences in sample distributions did not disproportionately reduce the class probability of minorities. These scores were then combined into a single “rank” score that spans both classes.

Next, we conducted Kruskal-Wallis to assess whether author types were significantly associated with combined rank scores to determine whether model performance was influenced by author type (such as registered nurses, residents, physicians participating in physicians). Additionally, total rank scores were used to identify the most consistently misclassified samples and manually review clinical notes related to the five lowest scoring cases. This qualitative review aims to determine whether classification errors are linked to human-interpretable inconsistencies in the quality of notes, such as missing information, ambiguous language, and document style variations.

Source link