Performance of machine learning models for predicting high-severity symptoms in multiple sclerosis

Machine Learning


Overview of approach

We developed the MS Mosaic mobile application during a two-year collaboration between a professional software development team, MS care providers, data scientists, and MS participants. User experience research studies were performed involving MS subjects which informed the design of the app. We publicly launched the app in the United States on the Apple App Store and ran a prospective, case-control, site-less study across three years from September 29, 2017, until December 19, 2020. After the completion of the study, the recorded data was retrieved and de-identified. Subsequently, for each subject, we partitioned the entire time period into sliding subject-time instances (across the time dimension) that contain the subject’s data at a specific period of time. In our study, the instance length was chosen to be seven days to accommodate the weekly surveys. For each instance (i.e. triggered at weekly intervals), we predicted whether the median self-reported severity of one of the five symptoms would be above a score of two (moderate disability) in the next three months. More specifically, when constructing the labels, we check whether the symptom with the given threshold has occurred at any point within the prediction window and we set it to 1 (or True) if so. This means that at every time point we are predicting whether or not the symptom will appear at all, with a given severity, over the next 3 months. The symptoms were chosen in a way that successfully predicting them is clinically actionable. For model development, we only used subjects (N = 713) who used the app for more than three months. The dataset was divided randomly by participants into 80% (N = 567) development and 20% blind test (N = 146) cohorts for model development and testing, respectively. We used the development cohort to train and validate the models using the 5-fold cross validation approach. Three classical machine learning (logistic regression, multi-layer perceptron, and gradient boosted classifier) and two deep learning methods (recurrent neural network and temporal convolutional network) were separately trained to find the best performing model.

Characteristics of study population

A total of 1804 participants downloaded the MS Mosaic app and enrolled in the study. Since the prediction horizon was set to three months, only participants who used the App for more than three months were considered for this analysis. This led to a cohort of 713 participants comprising 39.52% of users who downloaded the app. The amount of time each participant dedicated to the study varied, with the mean, median and standard deviation reported in Table 1. On average, participants used the app for at least a year, with the lower quartile at 146 days for the development set and 162 days for the test set, and the upper quartile at 598 days for the development set and 665 days for the test set. We want to also note, we saw a sharp drop in participants in March 2020 at the onset of the pandemic, which would skew these averages. The study was stopped in December 2020, meaning there were around 6 months of low utilization.

As a part of the initial survey, 623 participants (87.37%) of the cohort chose to report their age (mean [SD], 46.27 [11.46]). The development cohort and blind test cohort were chosen randomly and consisted of 567 (mean [SD] age, 46.42 [11.51]) and 146 (mean [SD] age, 45.71 [11.29]) participants respectively. Table 1 includes characteristics for the development and blind test cohorts. The mean [SD] number of days the participants used the app were 378 [267] and 388 [262] for the development and blind test set respectively. The mean [SD] diagnosis age of MS was 30.64 [10.99].

Table 1 Baseline characteristics of the development and blind test cohorts.

Across the 623 participants considered in this study, the most frequently reported comorbidities were: vitamin D deficiency (318 [51.04%]), hypertension (117 [18.78%]), thyroid disease (82 [13.16%]), cancer (34 [5.46%]), diabetes (34 [5.46%]), and seizure or epilepsy (32 [5.14%]). We report further details on all comorbidities reported by the participants in the Supplemental material. In addition, 146 [23.4%] participants reported a family history of MS.

A total of 482 (77.37%) participants had relapsing-remitting, 37 (5.94%) had primary progressive, 33 (5.30%) had secondary progressive, 11 (1.77%) had progressive relapsing MS (a term the medical community has since moved away from, but for the purposes of staying true to the data that was collected we report it here separately), and 21 (3.37%) were control participants.

Out of the 19 symptoms included in the daily symptom surveys, Fatigue was reported most frequently (33,754 [90.88%]), followed by weakness (31,126 [76.43%]) and bladder problems (30,874 [68.44%]). Cognitive changes or brain fog (30,433 [78.96%]), depression or anxiety (24,976 [77.14%]) and walking instability of coordination problems (24,667 [75.88%]) closely followed. The weekly relapse survey data showed that 1631 relapses were reported by 311 unique participants.

When looking at wearable data, a total of 713 participants from the leftover cohort recorded step and sleep counts. Percentage-wise that represents 31% of the development set, and 7% of the test set.

A more in-depth look at the frequency and dropout in per-feature imputation is presented in the supplementary material.

Model discrimination

Table 2 displays the performance of all models on predicting whether the median severity of self-reported symptoms within a forward-looking prediction window of 3 months is above two (moderate to high severity). The key performance measures assessing the performance obtained by each model on predicting the occurrence of high-severity symptoms are provided in Table 2. Gradient boosting Classifier outperformed other ML and DL models for all 5 symptoms. On the blind test cohort, GBC achieved an AUROC of 0.886 (95% CI 0.870–0.901) for cramps, 0.881 (95% CI 0.868–0.895) for depression or anxiety, 0.824 (95% CI 0.813–0.836) for Fatigue, 0.899 (95% CI 0.883–0.915) for sensory disturbance, and 0.899 (95% CI 0.891–0.909) for walking instability or coordination problems. The fact that GBC outperformed deep learning algorithms is not surprising and has been observed in relatively smaller and tabular datasets19.

Based on AUROC, the best performing deep learning model and the second best performing model overall was TCN. GBC demonstrated an increase in 0.155, 0.136, 0.042, 0.143 and 0.125 of AUROC compared to TCN on predicting the incidence of high-severity cramps, depression or anxiety, fatigue, sensory disturbance, and walking instability or coordination problems respectively.

With a 0.5 probability as the cutoff, GBC achieved a PPV and sensitivity of 0.771 [95% CI 0.736–0.815] and 0.459 [95% CI 0.427–0.498] for cramps, 0.800 [95% CI 0.772–0.827] and 0.538 (0.508, 0.573) for depression or anxiety, 0.731 [95% CI 0.704–0.752] and 0.593 [95% CI 0.573–0.616) for fatigue, 0.772 [95% CI 0.732–0.812] and 0.443 [95% CI 0.402–0.483] for sensory disturbance, 0.763 [95% CI 0.735–0.794] and 0.527 [95% CI 0.504–0.562] for walking instability or coordination problems respectively. We show the receiver operating characteristic curves for all models in Fig. 1 which further demonstrates the superiority of GBC.

Table 2 Results obtained by machine learning models when predicting a symptom severity \(> 2, 3\) months in advance.
Figure 1
figure 1

Receiver operating characteristic curves (left column) and calibration curves (right column) for machine learning models (gradient boosting classifier, logistic regression, multi-layer perceptron, recurrent neural network, and temporal convolutional network) for predicting whether the median value of a user-reported symptom will be of high-severity (\(>2\)) in the next three months. The symptoms considered are (A) cramps, (B) depression or anxiety, (C) fatigue, (D) sensory disturbance, and (E) walking instability or coordination problems respectively.

Model calibration

In addition to having good discriminative performance, machine learning models need to be well-calibrated. We use the Brier score to evaluate model calibration which are shown in Table 2 and the calibration curves are shown in the supplementary material. The trained Gradient Boosting Classifier models reports the lowest Brier score for all 5 symptoms thereby demonstrating their predictions have the lowest uncertainty. We verify this by visually inspecting the calibraton curves shown in Fig. 1 which demonstrate that the trained GBC models are well-calibrated for all symptoms.

Subgroup analyses

To evaluate the model performance in different multiple sclerosis subtypes, we selected participants from the blind test cohort who have self-reported to have one of the four common MS subtypes: relapsing remitting (103 participants), primary progressive (3 participants), secondary progressive (6 participants), and progressive relapsing (2 participants). The subgroups in this case were fairly small because there were not many participants in the full dataset who met the criteria for those subgroups. We would also like to acknowledge that the field has moved away from the term “progressive relapsing”, however at the time of the study design this was still in use and therefore we have left it in to report the data as collected. In future iterations these participants would be collapsed into a “progressive MS” group. The AUROC obtained by the GBC model for these subtypes are shown in Fig. 2. Overall the performance obtained on relapsing-remitting and secondary progressive participants were of higher quality compared to primary progressive and progressive relapsing. We reckon this is possibly due to the lower number of participants available for these subtypes both in the development and blind test set.

In addition, we also partitioned the blind test cohort according to age and report the performance obtained by GBC in Fig. 2. We followed the categorization of Table 1, however, removed the “age \(> 70\)” subgroup since it contained only one participant in the blind test cohort. The remaining subgroups “ age \(< 30\)”, “30 \(<=\) age \(< 50\)”, and “50 \(<=\) age \(< 70\)” contained 9, 67, and 44 participants respectively in the blind test set. Overall we observe AUROC to be in similar range across age-based subgroups.

Figure 2
figure 2

Area under the receiver operating characteristic curve (AUROC) obtained by Gradient boosting classifier (GBC) for predicting occurrence of high-severity symptoms on different subgroups of the data: (A) four subtypes of multiple sclerosis and (B) three age groups. We include the 95% confidence intervals as error-bars. The bar representing “age \(< 30\)” for Cramps is missing from the figure due to the absence of participants in that intersection.

Identification of key predictive features

Figure 3 showcases the Gradient Boosting Classifier’s performance using only the elements of distinct feature groups: symptoms, demographics (only included age in our case), functional tests, passive signals, active features (union of symptoms and functional tests), and finally a set with all features. Note that age is the only static feature in our feature set, while others are continually collected.

The “all features” group has the highest performance out of all the segments, however the results depict that symptoms alone account for the biggest performance contribution compared to all remaining groups in most cases. Out of the sequential features, passive signals are found to be least predictive for all symptoms. The performance of functional tests come between symptoms and passive signals (four out of five symptoms). Overall the static “demographics” feature show the least performance compared to the dynamic feature groups.

In addition to investigating the impact of categories of features, we identified importance of individual features by performing permutation feature importance for the GBC model20. We found that the most predictive feature for a particular symptom is its past trajectory—consistent across all five symptoms considered in Table 2. This is as expected, and serves as a sanity check for our resulting models. The next most predictive features are not predictable and can be either other symptoms, functional tests, or passive signals. For example in the case of cramps we can see that the top 5 predictive features are all related in some way to spasticity, but bladder spasticity is an interesting one to make the top 5. We provide the top five features and their importance scores for each symptom in the Supplementary material.

Feature ablation studies

One important thing to note is that when the model is predicting the future trajectory of a particular symptom, it’s past data is included in the feature set. First, we ran an ablation study where for each symptom label we removed the predicted symptom from the feature set, so that when the model is predicting whether a symptom will be severe three months in the future, it has no information of what happened in the past. Results are reported in Table 2. AUROCs obtained were 0.755 [95% CI 0.734–0.773], 0.763 [95% CI 0.753–0.775], 0.744 [95% CI 0.727 0.761], 0.821 [95% CI 0.81–0.836], and 0.840 [95% CI 0.826 0.852] for cramps, depression or anxiety, fatigue, sensory disturbance, and walking instability, demonstrating a drop of 0.131, 0.083, 0.071, 0.090, and 0.048 respectively. On one hand, the performance drop indicates that a particular symptom’s past data is important while predicting its future trajectory. This finding is consistent with the feature importance analysis described in “Identification of key predictive features” section. On the other hand, even without that and by using data from solely other symptoms, functional tests, passive signals, and demographics, high performance could be achieved. This is also consistent with “Identification of key predictive features” section where we report that other predictive features apart from top most feature don’t follow a pattern. This highlights the importance of considering all available data collectively, which proves the need for methods that can analyze a wide range of data simultaneously.

Second, we included a simple rule-based algorithm as a baseline in Table 2 which produces a positive prediction if the median value of the severity score of the symptom in the previous three months was above threshold i.e. propagates the past. We term this rule as the “symptom propagation method” and it substantially under-performs GBC. This is meant to serve as proof that simply taking into account the symptom’s past values is not enough for an accurate assessment of that symptom’s future trajectory.

Figure 3
figure 3

Area under the receiver operating characteristic curve (AUROC) achieved by Gradient Boosting Classifier (GBC) while predicting the incidence of high-severity symptoms on different groups of features: symptoms, demographics (only included age in our case), functional tests, passive signals, active features (combination of symptoms and functional tests), and lastly a set with all features. 95% confidence intervals have been included as error-bars.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *