A systematic review of machine learning approaches in cochlear implant outcomes

A comprehensive search was conducted across all relevant databases to identify studies examining the application of ML to predict CI outcomes. A total of 1580 articles were retrieved. After removing duplicates, 1537 abstracts were screened for eligibility based on pre-defined criteria. Following the abstract screening, 31 full-text articles were selected for a detailed evaluation. Ultimately, 20 articles were deemed pertinent for information extraction based on their relevance to the research question. Two investigators reviewed the methodology of the selected articles in detail using the Newcastle Ottawa Scale adapted for cross-sectional studies for quality assessment¹⁸. The studies were categorized into four distinct groups based on the variables used to predict outcomes after CI surgery: (i) Brain imaging variables (studies: pediatric = 3, adult = 2); (ii) Neural function measures (studies: pediatric = 1, both age groups = 1); (iii) Clinical, audiological, and speech perception/production variables (studies: pediatric = 3, adult = 6); and (iv) Algorithms for speech enhancement (studies: adult = 2, simulation study=1); all adult CI studies included participants with post-lingual hearing loss. Additionally, the performance of various ML algorithms used within the selected studies was compared to evaluate their effectiveness in predicting CI outcomes (Table 3).

Table 3 Overview of clinical, audiological, speech perception, and production-based studies for CI outcome prediction

Prediction based on studies of brain imaging measures in pediatric CI users

Several studies highlight the growing potential of combining neuroimaging techniques with ML models to predict CI outcomes in children^16,19,20. These studies show the role of early auditory brain development and the preservation of specific brain regions in predicting CI outcomes, demonstrating the potential of neuroimaging and ML techniques in personalizing CI candidacy and post-surgical expectations (Table 4).

Table 4 Machine Learning -based speech enhancement algorithm

Tan et al.¹⁶ used functional magnetic resonance imaging (fMRI) to predict language skills in children with congenital sensorineural hearing loss (SNHL) before the CI surgery (n = 23, mean age: 20 months, range = 8–67 months). Children with normal hearing (NH) served as controls (n = 21, mean age: 12.1 months, range = 8–17 months). Language skills were assessed two years after surgery using the Clinical Evaluation of Language Fundamentals-Preschool, Second Edition (CELF-P2) for the 16 children who completed follow-up; seven children were unavailable for follow-up. Prediction accuracy was compared between supervised and semi-supervised support vector machine (SVM) classifiers. Input feature vectors for the models were generated using contrast maps of fMRI data, processed with a general linear model and a Bag-of-Words (BoW) approach. Results revealed that cortical activation patterns observed during fMRI in infancy correlated with language performance two years post-CI. The semi-supervised SVM model, using the BoW feature extraction approach, outperformed the supervised model, achieving a classification accuracy of 93.8% and an area under the curve (AUC) of 0.92, compared to the supervised model’s accuracy of 68.8% and AUC of 0.71. The study identified the left temporal gyri (superior and middle) as the most predictive brain region for distinguishing effective from ineffective CI users. However, regression analysis—controlling for age at implantation and preoperative hearing levels—showed that this region alone could not fully separate the two groups. When a second feature region, located in the right cerebellum, was included, the model successfully classified effective versus ineffective users. Remarkably, despite the small sample size, the study demonstrated that just two features derived from fMRI contrast maps (speech versus silence) were sufficient to classify the groups using a semi-supervised learning approach. This underscores the potential of fMRI-based biomarkers to predict CI outcomes early in development.

Feng et al.¹⁹ applied SVM based on neural morphological data from MRI to predict speech perception abilities in individual children with CIs. In order to identify the brain structures impacted or unaffected by auditory deprivation, they compared neuroanatomical differences using voxel-based morphometry and multivoxel pattern similarity between bilateral SNHL (n = 37, mean age at implantation = 17.9 months, range = 8–38 months) and NH children (n = 40; mean age = 18 months, range = 8–38 months). Speech perception was assessed using the speech reception in quiet test, conducted before surgery and six months after CI activation. A linear SVM classifier was applied to distinguish CI candidates from NH participants. The classifier achieved high accuracy, with grey matter multivoxel pattern similarity and density reaching 95.9% and 97.3%, respectively, and white matter multivoxel pattern similarity and density achieving 97.3% and 91.9%, respectively. The study identified that brain regions unaffected by auditory deprivation, particularly those involved in auditory association and cognitive functions, were the most reliable predictors for classification. Additionally, the dorsal auditory network, a critical component for speech perception, emerged as the most robust predictor of future speech perception outcomes in children with CIs. These findings indicate the importance of preserved auditory and cognitive brain regions in predicting CI outcomes.

Song et al.²⁰ examined the utility of functional connections derived from fMRI combined with ML models to predict CI outcomes in children and to identify SNHL. The sample included 68 children with SNHL (mean age = 46.24 months, SD = 24.38 months) and 34 NH children (mean age = 45.62 months, SD = 27.63 months). Additionally, 52 children with SNHL who underwent CI were analyzed to build a model predicting postoperative auditory performance, measured by the categories of auditory performance score. They used kernel principal component analysis for dimensionality reduction of functional connections and three algorithms for classification: SVM, logistic regression, and k-nearest neighbor (kNN). A voting ensemble method, which averaged the predicted probabilities of the three classifiers, achieved an AUC of 0.84 for distinguishing SNHL from NH. For predicting categories of auditory performance scores after CI surgery, a multiple logistic regression model achieved an accuracy of 82.7%. Although the multiple regression model had a higher AUC compared to the individual classifiers, the differences were not statistically significant (p > 0.05, DeLong test). Similarly, the voting ensemble method achieved a higher AUC than individual classifiers, but this improvement was also not significant (p > 0.05). These findings indicate that fMRI-based functional connections combined with ML algorithms hold promise for the prediction of CI outcomes in children. However, the model did not support a clinical diagnosis of SNHL.

Prediction based on studies of brain imaging measures in adult CI users

Sun et al.²¹ explored how accurately voxel-based morphometry could predict word recognition scores in adult CI users (n = 47) by analyzing gray matter density and structural changes in cortical regions. Adults with unilateral SNHL (n = 35) served as controls. They applied random forest (RF) and linear SVM regression models to MRI brain scans. The region of interest (ROI)-based method produced a higher mean absolute error (MAE) of 15.9. In contrast, the cluster-based approach, which combines clinical features with imaging data, yielded a more accurate prediction with a relatively lower MAE of 14.25 (a lower MAE indicates better model prediction). This finding highlights the advantage of integrating diverse data sources for improving word recognition score predictions. The study also showed that the right medial temporal cortex and right thalamus are key brain regions for predicting word recognition scores in CI users. Among clinical features, the duration of deafness had the strongest influence on predictions, followed by the age at CI surgery. Additionally, the RF model consistently outperformed the SVM model across both the ROI-based and cluster-based approaches. This study highlights the importance of combining brain structural imaging with clinical data for word recognition score predictions and demonstrates the higher accuracy of RF models over SVMs for outcome predictions in adults.

Kyong et al.²² examined the role of cortical cross-modal plasticity changes in predicting CI outcomes in adult CI users. They used electroencephalography (EEG) measures such as cortical auditory evoked potentials, cortical somatosensory evoked potentials, and cortical visual evoked potentials as biomarkers to predict CI outcomes. They evaluated CI outcomes in terms of features like sensor level (latency and amplitude), source level (current source density), and a combination of both. An SVM was applied to 13 datasets from 3 patients. Prediction accuracy differed between modalities; interestingly, tactile stimuli showed the highest accuracy (auditory: 82.71%, tactile: 98.88%, visual: 93.55%). Classification accuracy was generally higher when combining sensor and source features, compared to using sensor or source features alone—except for the tactile modality, where source-level features alone achieved comparable accuracy. These findings suggest that features from auditory, tactile, and visual stimulation at both the sensor and source levels, or a combination, can serve as inputs for ML models to predict CI outcomes. Additionally, the study supports the idea that cross-modal brain plasticity resulting from deafness may provide a basis for predicting CI outcomes.

Prediction based on neural function measures

There were two studies in this category: one focused on pediatric populations²³ and one on CI users from both age groups²⁴. Lu et al.²³ used a SVM classifier to predict postoperative outcomes in children with anatomically normal cochlea but cochlear nerve deficiency. A total of 70 children with CIs, with a mean age of 27.31 months (SD = 13.92 months), were included in the study. Multiple data types such as demographic, radiographic, audiologic, and speech assessments, were included to build the model. The outcome measures included categories of auditory performance, speech intelligibility rating, and infant/toddler meaningful auditory integration scale after two years of CI. Post-operative hearing and speech rehabilitation outcomes were classified using the SVM algorithm. They reported that children with a higher number of nerve bundles and a larger vestibulocochlear nerve area showed better CI outcomes in terms of both hearing and speech rehabilitation measures. A significant positive correlation was observed between categories of auditory performance scores and speech intelligibility rating at two years post-CI surgery, as well as the number of identifiable nerve bundles and the area of the vestibulocochlear nerve. The model predicted postoperative hearing with an accuracy of 71% and speech rehabilitation with an accuracy of 93%, suggesting that a relatively functional cochlear nerve tends to result in better CI outcomes.

Skidmore et al.²⁴ compared linear regression, SVM, and logistic regression models to predict auditory nerve function in bilateral CI users. The input variables were derived from electrically evoked compound action potentials (eCAP) refractory recovery and input/output (I/O) functions. Study participants were children with cochlear nerve deficiency (n = 23, mean age = 3.42 years), children with normal-sized cochlear nerves (n = 29, mean age = 3.18 years), and adults (n = 20, mean age = 69.22 years) with normal-sized cochlear nerves. The three models predicted two distinct distributions of cochlear nerve indices for cochlear nerve deficiency and normal-sized cochlear nerves, with classification accuracy of 0.93 for the linear model, 0.91 for SVM, and 0.95 for logistic regression. In adult CI users, although the models varied slightly, cochlear nerve indices were found to correlate with Consonant-Nucleus-Consonant word and AzBio sentence scores in quiet. These findings suggest that machine learning models can accurately predict auditory nerve function in bilateral CI users, with cochlear nerve indices correlating with speech recognition performance.

Prediction based on clinical, audiological, speech perception, and production data studies in pediatric CI users

Few studies have explored ML models to predict developmental and speech outcomes in children with CIs^25,26,27. By integrating clinical and audiological data into the ML models, these studies offer additional insights into the known factors that influence CI success.

Abousetta et al.²⁵ examined a scoring system to improve CI candidacy selection. Data from 100 children, with a mean age of 78.28 months (S.D = 31.63), were collected from three rehabilitation centers following CI surgery. Statistical and ML approaches were applied to analyze the data. They used metrics related to language, phonological, and social deficits to quantify developmental delays (in months) in these areas. The classification predictive models achieved superior validation accuracy compared to linear regression in predicting phonological deficits (88.11%). However, the accuracy for language and social deficits was moderate (56.66% and 40.46%, respectively). In the regression analysis, the RF model outperformed the evaluated models in predicting language age and phonological deficit. The MAE for the RF model was 8.95 and 5.51 months for language and phonological deficits, respectively. The linear regression model had an MAE of 8.29 months for the social deficit. The average duration of auditory deprivation and family support emerged as significant factors affecting language, phonological, and social deficits following CI. These variables each contributed more than 17% to the overall model weight in predicting the outcomes. Thus, the scoring system incorporating statistical and ML approaches effectively predicted developmental deficits in children post-CI, with the RF model showing superior accuracy in predicting phonological and language deficits. At the same time, factors like auditory deprivation and family support played significant roles in the outcomes.

Byeon²⁶ used the RF model to investigate the factors affecting articulation accuracy in children with CI. The study involved 82 children (4 to 8 years, mean age = 6.3 years, SD = 3.1 years) who were using a CI for at least one year but less than five years. Articulation accuracy was measured using a nine-sentence speech intelligibility test and was rated by two undergraduate students. The variables used were age, family income, gender, duration of CI use, vocabulary level, and corrected hearing. The model achieved a classification accuracy of 78.80% in predicting speech intelligibility. The analysis revealed that duration of CI use, vocabulary skills, household income, age, and gender significantly influenced speech intelligibility outcomes.

In a follow-up study, Byeon²⁷ used ML models to predict the intelligibility of speech produced by children with CIs. Their study included 91 children with CIs, and 80 college students evaluated their speech samples (speaking and reading). The RF model had the lowest MAE of 0.81 and the lowest root mean squared error (RMSE) of 0.108, indicating that it made the most accurate predictions among the methods tested (multiple regression analysis, SVM regression, and RF). In addition, duration of CI use, auditory training, corrected hearing, and age were some of the variables that influenced speech intelligibility, including pitch, loudness, and speech quality.

Prediction based on clinical, audiological, speech perception, and production data studies in adult CI users

Several studies have utilized ML techniques to predict CI outcomes based on a combination of clinical, audiological, and speech perception data^{15,17,28,29,30,31}. These studies show reasonable accuracy of ML algorithms in predicting CI outcomes, with RF models consistently outperforming other methods, and substantiate the importance of preoperative clinical and audiological factors in determining CI success.

Ramos-Miguel et al.²⁸ used data mining techniques, including the kNN model for classification and linear regression for estimating influential variables, to develop algorithms predicting the performance of adult CI users (n = 60) on disyllabic word tests. Input factors were categorized into demographics, hearing aid use, CI use, audiological data, and quality of life. The kNN algorithm achieved a 90.83% success rate in predicting test performance, while linear regression showed a strong positive correlation between predicted and actual scores (R² = 0.96), with a mean error of 5.99% (SD = 4.25%). Key predictors included disyllabic word scores in the first implanted ear, the time between CI surgeries, the type of hearing loss (prelingual or post-lingual), and residual hearing in the non-implanted ear. Their study demonstrated that data mining techniques could predict word test performance in adult CI users and identified relevant predictors. Similarly, Guerra-Jiménez et al.²⁹ applied a data mining technique (kNN algorithm) to predict the benefits of CI in terms of speech recognition and quality of life in 29 adult CI users. The Glasgow benefit inventory³² and the specific questionnaire³³ were used to evaluate the benefits and their association with quality of life. The kNN method achieved 80.7% accuracy for predicting speech recognition and quality of life, while decision tree analysis of Glasgow benefit factors reached 81% accuracy. Linear regression yielded 85% accuracy for speech recognition, 68% for Glasgow benefit inventory, and 71% for specific questionnaires. They identified factors such as age, duration of deafness, prior hearing aid use, and preoperative residual hearing as influencing speech recognition and quality of life.

Kim et al.¹⁵ compared the predictive accuracy of the GLMs and RF model for postoperative CI outcomes in 120 adults. The preoperative factors, such as duration of deafness, age at implantation, and duration of hearing aid use, were input as predictors for word recognition scores. The RF model significantly outperformed the GLM (p < 0.00001), achieving a correlation coefficient of 0.96 and an MAE of 6.10 (SD = 4.70), compared to the GLM’s coefficient of 0.70 and MAE of 15.60 (SD = 9.50). Adding principal component analysis to the RF improved the prediction with a coefficient of 0.97 and an MAE of 4.80 (SD = 4.40). The cross-validation of the RF model with a new dataset resulted in a higher MAE of 17.10. This discrepancy likely stemmed from differences in how word recognition scores were measured across the datasets. To address this, the authors assumed a linear bias and applied a post-hoc GLM correction, incorporating the test site as a covariate to combine data from all three sites. This adjustment reduced the MAE of the RF model to 9.60 (SD = 5.20) for the test cohort. Duration of deafness was the strongest preoperative predictor of postoperative word recognition scores, followed by hearing aid use duration and age at CI surgery. In contrast, preoperative hearing ability and word recognition thresholds showed weaker correlations with postoperative word recognition scores. Their findings suggest RF as a robust model for predicting CI outcomes and emphasize the effect of dataset variability on prediction accuracy.

Crowson et al.³⁰ predicted the postoperative outcome in adult CI users using various categorical (e.g., hearing loss cause) and numerical variables (e.g., pure tone average). A total of 282 preoperative variables were included in the ML model. The outcome variable was the hearing in Noise Test (HINT) score after one year of CI. The ML algorithms included neural networks and an XGBoost gradient-boosted tree algorithm. The numerical variables were given as input for the neural networks, and the prediction of HINT scores resulted in an RMSE of 57.0% (a lower RMSE is better) and a classification accuracy of 95.40%. Adding the categorical variables to the model reduced the RMSE to 25% (better) and classification accuracy to 73.30% (reduced). When the XGBoost algorithm was applied with only numerical variables, the HINT score RMSE prediction performance was 25.30%. The most crucial preoperative variable found was the HINT sentence score, followed by age at surgery. The XGBoost ensemble decision tree model could predict the association between the various preoperative measures and HINT scores. They also found the effect of subjective factors like quality of life and vestibular function in predicting postoperative performances.

Shafieibavani et al.¹⁷ assessed seven different ML models to predict postoperative word recognition scores in adult CI users and examined how well the outcomes from these models can be generalized to new datasets. Input to models considered various factors such as demographics, hearing test results, medical history, and causes of hearing loss. They used the RF model, extreme gradient boosting with linear models (XGB-Lin), extreme gradient boosting with random forest (XGB-RF), artificial neural networks (ANN), and three more baseline models (linear and RF models) to predict word recognition scores. Additionally, they examined the influence of sample size on the model’s accuracy. XGB-RF with all features achieved the best predictive performance (median MAE: 20.81), while the RF performed similarly (median MAE: 20.76). The performance of the XGB-RF model on the new datasets slightly varied across the three new datasets, and the MAE ranged from 17.90 to 21.80, depending on the dataset. Doubling the sample size improved the model performance by 3%. The XGB-RF achieved the best predictive performance among seven different ML models (XGB-RF, RF, ANN, XGB-Lin model, model A, model B, model C) that were compared, and it was statistically significant. Overall, this study found that the XGB-RF model provided the best predictive performance for postoperative word recognition scores in adult CI users, with slight variations across new datasets and improved accuracy with larger sample sizes.

Zeitler et al.³¹ examined supervised machine learning classifiers to predict postoperative acoustic hearing preservation in 761 adult CI users, utilizing variables such as standard pure tone average (SPTA), low-frequency pure tone average (LFPTA), hearing preservation pure tone average (HPPTA). Their analysis involved two phases: statistical analysis using multivariate logistic regression to identify associations between covariates and outcomes, followed by feature engineering and evaluation of different supervised learning classifiers. They compared the relative performance of gradient boosting machine (GBM), AdaBoost, RF, SVM, kNN, and Gaussian naive bayes (GNB) in predicting change in hearing thresholds before and one month after CI surgery. The RF model was found to be the superior classifier based on mean performance across validation cycles, and it achieved the highest average accuracy in predicting each variable on the validation set. For SPTA, the RF had a mean MCC (Mathews Correlation Coefficient) of 0.52 (SD = 0.11) and a mean AUC of 0.83 (SD = 0.05). Similar trends were observed for LFPTA (MCC mean = 0.42, SD = 0.12; AUC mean = 0.73, SD = 0.07) and HPPTA (MCC mean = 0.38, SD = 0.10; AUC mean = 0.76, SD = 0.02). GNB showed lower classification performance compared to other algorithms (GBM, SVM, RF, kNN, and AdaBoost classifiers) and was statistically significant (p < 0.001). The preoperative LFPTA and standard PTA were found to be important predictors of hearing preservation. A significant negative association was observed between the predictor variables, such as sudden hearing loss, noise exposure, aural fullness, and abnormal ear anatomy, and the response variable, one-month change in the lowest quartile of the SPTA. In contrast, a significant positive association was found between preoperative LFPTA and the one-month change in the lowest quartile of LFPTA. This study demonstrated the RF model outperformed other ML classifiers in predicting postoperative acoustic hearing preservation in CI recipients, with preoperative LFPTA and standard PTA as important predictors of hearing outcomes.

Together, these studies underline the growing importance of ML for adult CI applications, especially in predicting speech recognition scores and other critical outcomes. The consistent performance of RF models, in particular, suggests that they are among the most reliable methods for predicting CI outcomes across different datasets and clinical settings. These findings support the notion that a combination of preoperative clinical data, audiological assessments, and speech perception measures can be effectively used in predictive modeling to optimize CI outcomes and guide clinical decision-making.

ML-based speech enhancement algorithm

ML techniques can be applied to the design of CI speech processors. To date, studies on speech enhancement algorithms based on ML have been conducted exclusively in adult CI users. Few studies have explored the integration of ML algorithms with CI strategies to address challenges faced by CI users, such as recognizing speech in noise and minimizing interference from multi-talker environments^34,35,36,37. Together, these studies emphasize the promising role of ML techniques in optimizing speech processing for CI users to improve communication in noisy environments.

Goehring et al.³⁴ compared speech-in-noise recognition using the Advanced Combination Encoder (ACE) strategy with and without a Neural Network Speech Enhancement (NNSE) algorithm. The NNSE algorithm was designed to improve speech intelligibility in noise by attenuating noise-dominated channels and preserving speech-dominated channels. They evaluated the algorithm measuring speech recognition threshold in 14 CI adult users and employing three types of background noise: speech-weighted noise, multi-talker babble, and International Collegium of Rehabilitative Audiology (ICRA) noise. The NNSE algorithm integrated into the ACE strategy consisted of a feature extraction step and a neural network implementation. The features were extracted from the noisy speech signals and passed through a feedforward neural network. The SRT results showed that the NNSE-enhanced ACE strategy outperformed the unprocessed ACE for all noise types, with improvements ranging from 1.4 to 6.4 dB at different signal-to-noise ratios. They implemented a speaker-dependent and a speaker-independent version of the NNSE algorithm. The speaker-dependent NNSE provided higher gains (up to 6.4 dB in ICRA noise), while the speaker-independent version showed improvements in 2 out of 3 noise types, but to a lower extent. Thus, this study demonstrated that incorporating an NNSE algorithm into the ACE strategy could significantly improve speech-in-noise recognition for CI users, with the speaker-dependent version yielding the most substantial improvements across various noise conditions.

In a simulation study, Grimm et al.³⁵ compared the potential negative impact of channel interaction on speech perception between congenitally deaf and post-lingually deaf adults with CIs. They approximated the speech stimulus to simulate how it stimulates the normal cochlea versus CI. The neural networks were provided with high-resolution (32 channels), like intact cochlea, and low-resolution (16) linearly combined channels similar to CI-delivered speech. The low-resolution networks were designed to mimic the limitations of CI, specifically by introducing channel interaction into the speech data. The models were initially trained on high-resolution speech and then tested on modified, low-resolution speech with channel interaction. Channel interaction in low-resolution speech significantly influenced the performance of the networks. This suggests that spectral degradation due to channel interaction in CIs may impede auditory learning in post-lingual CI users. Overall, these findings provide further evidence for the effect of channel interaction on speech perception in CI users, emphasizing the challenges post-lingual users may face in learning.

Lai et al.³⁶ developed a deep neural network method that combined auditory and visual cues (lip movements) to enhance speech perception for individuals with CIs in noisy environments. Their proposed model, self-supervised learning-based audio-visual speech enhancement (SSL-AVSE), combines visual and auditory signals from the target speaker. The features of the AV-HuBERT model were extracted from the combined audio and visual data, which were then processed using a bidirectional long short-term memory model. The study included 80 participants, with 20 individuals allocated to each noise or sound condition to minimize cross-referencing bias. The SSL-AVSE method significantly improved speech enhancement performance, as measured by the perceptual evaluation of speech quality and short-time objective intelligibility tests. Additionally, the method was evaluated using a CI vocoder to verify its intelligibility. The SSL-AVSE method showed significant improvements in the presence of dynamic noise at different signal-to-noise ratio conditions, and the normal correlation matrix scores improved from 26.5% to 87.2% when compared to the baseline model. Thus, combining auditory and visual cues through the SSL-AVSE model significantly enhances speech perception in noisy environments for CI users.

Borjigin et al.³⁷ investigated the effectiveness of deep neural networks, specifically a recurrent neural network and SepFormer, in reducing multi-talker noise interference for CI users. The study used a custom data set consisting of clean target speech and different noise types mixed at signal-to-noise ratios ranging from 1 to 10 dB datasets. The recurrent neural network architecture consisted of an input layer of 512 units, two hidden long short-term memory layers of 256 units each, and a projection layer of 128 units, while the SepFormer used a single-layer convolutional network as an encoder to learn 2-dimensional features. The scale-invariant source-to-distortion ratio, short-time objective intelligibility, and perceptual evaluation of speech quality validated the performance of the models. The algorithms were tested on 13 adult CI users and showed that both deep neural network models significantly improved speech intelligibility in stationary and non-stationary noise conditions. These results highlight the potential of advanced neural network architectures, such as recurrent neural networks and SepFormer, to enhance auditory experiences for CI users in noisy environments.

Source link