Participants
Participants included 10 healthy, age-, education-, and hearing-matched control participants, 10 individuals with svPPA, 10 individuals with nfvPPA, and 10 individuals with lvPPA (Table 1; note that control participants and participants with lvPPA are also presented in33). Participants with PPA were recruited as part of a speech-language intervention trial conducted by the Aphasia Research and Treatment Lab at the University of Texas at Austin35,36,37,38. Individuals with PPA were required to have a Mini-Mental State Exam39 score greater than 15 and to meet criteria for one of the canonical subtypes of PPA based on international consensus criteria1. Clinical diagnosis was based on comprehensive neurological and cognitive-linguistic assessment. Exclusion criteria for controls included a history of stroke, neurodegenerative disease, severe psychiatric disturbance, or developmental speech and language deficits. Due to the acoustic nature of the stimuli, hearing thresholds at 500, 1000, 2000, and 4000 Hz were collected for both ears. The pure tone average across frequencies and ears is reported in Table 1 for each participant group. The study was approved by the Institutional Review Board of the University of Texas at Austin and participants provided written informed consent. The study was conducted in accordance with relevant guidelines and regulations. Because control participants were not recruited as part of the larger clinical trial, they were paid $15/hour for their participation. All participants were native English speakers who spoke English as their primary language.
Stimuli and task
Stimuli consisted of 15-minute segments from each of two audiobooks, Alice’s Adventures in Wonderland50, and Who Was Albert Einstein?51, the latter of which has been validated for use in stroke-induced aphasia52. Each audiobook was divided into 15 one-minute tracks, ensuring that each track started and ended with a complete sentence. Stimuli were presented binaurally using insert earphones (ER-3A, Etymotic Research, Elk Grove Village, IL). After listening to each track, participants were asked two multiple choice questions to encourage close attention to the audiobook (accuracy presented in Table 1). These questions were not evaluated for their validity in assessing story comprehension, though we note that an analysis of variance revealed significant differences across the groups (F (3, 26) = 8.21, p < 0.001); post hoc comparisons performed using Tukey’s Honestly Significant Difference test indicated that individuals with lvPPA and svPPA performed significantly worse than control participants, and individuals with svPPA also performed significantly worse than individuals with nfvPPA. To mitigate fatigue, participants were given the opportunity to take a break between tracks and were instructed to press the spacebar when they were ready to move on. For two participants with svPPA and five participants with nfvPPA, data were only available for the 15 tracks from Alice’s Adventures in Wonderland (see Supplementary Materials, Supplementary Table 1), creating an imbalance in samples between subtypes. We chose to use F1 for ranking classifier performance in order to minimize issues related to this imbalance (see Analyzing model performance for definition and justification for using F1).
EEG data collection and preprocessing
While participants listened to the audiobooks, EEG data and audio were sampled at 25,000 Hz using a 32-channel (10–20 system) BrainVision actiCHamp active electrode system and BrainVision StimTrak, respectively (Brain Products, Gilching, Germany). The data were re-referenced offline using the common average reference. EEG data were preprocessed using EEGLAB 2019.153 in MATLAB 2016b (MathWorks Inc., Natick, MA, USA). Data were downsampled to 128 Hz, then filtered from 1 to 15 Hz using a non-causal, Hamming windowed-sinc FIR filter (high pass filter cut-off = 1 Hz, filter order = 846; low pass filter cut-off = 15 Hz, filter order = 212). Channels whose activity was > 3 standard deviations from surrounding channels were rejected and replaced via spherical spline interpolation. Large artifacts were suppressed using artifact subspace reconstruction54, with sixty seconds of manually-defined clean data used as calibration data. Lastly, independent component analysis using the infomax algorithm was performed to correct for eye movement, muscle, and electrocardiographic artifacts, with components manually identified for correction. The cleaned EEG data were further filtered into the delta (1–4 Hz) and theta (4–8 Hz) bands, as these two frequency bands have been identified as important for speech processing but may support different aspects of processing. Specifically, the delta band has been linked to processing longer speech units (e.g., words and phrases) and the theta band has been linked to processing shorter speech units (e.g. syllables)32.
Acoustic feature derivation
Cortical tracking of the speech envelope has proven sensitive to hearing impairment in neurotypical older adults55 and multiband envelope tracking has been shown to differ significantly between individuals with lvPPA and neurotypical older adults33. Thus, we investigated whether TRFs reflecting cortical tracking of acoustic features would be successful in PPA classification. Two acoustic features, the multiband speech envelope and broadband envelope derivative, were calculated for each of the audio tracks to be used for TRF modeling.
Multiband speech envelope
The multiband speech envelope reflects syllable, word, and phrase boundaries as well as prosodic cues56,57. To derive the multiband speech envelope, auditory stimuli from the audiobooks were first filtered through 16 gammatone filters to produce 16 bands58. The absolute value of the Hilbert transform in each of the 16 bands comprised the multiband stimulus envelope, which was then raised to a power of 0.6 to mimic the compression characteristics of the inner ear59. This resulted in 16 band-specific speech envelopes. TRFs were estimated for each of the 16 bands. The TRF beta weights were averaged across the 16 bands for ML classification.
Broadband envelope derivative
The broadband envelope derivative reflects acoustic onsets and offsets critical for identifying syllable, word, and phrase boundaries60. The auditory cortex, including the superior temporal gyrus, has been shown to be particularly sensitive to acoustic edges61. Considering that the superior temporal gyrus is a site of prominent atrophy in lvPPA62, we sought to determine whether cortical tracking of the broadband envelope derivative would be useful for PPA classification. Thus, we took the first temporal derivative of the broadband envelope to be used for TRF estimation. Only the positive values of the derivative were used.
Linguistic feature derivation
Linguistic features were selected that correspond to the core language domains implicated in PPA and used for PPA subtype classification. Specifically, we selected features reflecting phonological processing (significantly impaired in lvPPA), semantic processing (significantly impaired in svPPA), and syntactic processing (significantly impaired in nfvPPA). Critically, the specific linguistic features we selected to represent each of these levels of processing have been demonstrated to have better-than-chance prediction accuracy in previous studies utilizing TRF modeling63,64,65. Prosodylab-Aligner66 was used to temporally align phonemes and words with the audio tracks (i.e., for identification of phoneme and word onsets and offsets), with manual correction by expert linguists and highly trained research assistants. [Because of coarticulation, there is no “ground truth” for where one phoneme/word begins and another ends, and so we thus emphasized consistency in alignment by having the first author review each track, making edits as needed. We note that although “errors” in alignment would impact the accuracy of TRF modeling, this would be consistent across participants and therefore should not impact classification performance.] Phoneme and word onsets were subsequently used to temporally align linguistic features with the EEG responses.
Phonological feature: cohort entropy
Cohort entropy quantifies the degree of uncertainty regarding word identity at the current phoneme based on competition among words in the cohort (the list of words with the same phonemes up to that point in the word). It was derived at the phoneme level and mapped to phoneme onsets for TRF estimation. Notably, the first phoneme in each word lacks a feature value. A phoneme’s cohort entropy is defined as the Shannon entropy for the cohort of words consistent with the phonemic makeup up to that phoneme64. Each word’s entropy is defined as its word frequency multiplied by the natural log of its word frequency. To derive word frequency, the frequency count of the word was determined based on the SUBTLEX_us_2007 corpus67 and then divided by the total number of words in the corpus, forming a probability distribution among the words; frequency is then defined as the natural logarithm of each word’s probability. For the ith phoneme in a word, the following formula was used to compute cohort entropy.
$$\sum\limits_{word}^{cohort} {freq\left( {word} \right) \cdot ln\left( {freq\left( {word} \right)} \right)}$$
Semantic features
These features were derived at the word level and were subsequently mapped to word onsets for TRF estimation.
Word frequency
Word frequency represents how frequently a word appears in the English language. As previously indicated, to derive word frequency, the frequency count of the word was determined based on the SUBTLEX_us_2007 corpus67 and then divided by the total number of words in the corpus, forming a probability distribution among the words; frequency is defined as the natural logarithm of each word’s probability, assuming no prior context64. For any word w, its word frequency can be mathematically formulated as its natural log probability, \(ln\left( {p\left( w \right)} \right)\), where p represents probability as defined above, independent of context.
Semantic dissimilarity
Semantic dissimilarity represents how semantically dissimilar a word is compared to the preceding words in a sentence63. To calculate semantic dissimilarity, we first used the well-established NLP model GPT2 to derive a semantic feature vector for each word68. GPT2 was chosen because it is a widely used neural language model yielding contextualized word representations (i.e., “feature vector”)69 that are sensitive and accurate to preceding context. Computations were run on Google Colab Pro’s GPUs and TPUs. Semantic dissimilarity was then derived by taking each word’s GPT2 feature vector and obtaining 1 minus the correlation coefficient between that vector and the mean of the vectors for all previous words in the sentence. As such, the first word for each sentence does not have a feature value. Dissimilarity values ranged from 0 to 2, with larger values reflecting larger dissimilarity. SciPy was used to compute the mean feature vector across words and NumPy was used to compute the correlations across feature vectors. For the ith word in a text, its semantic dissimilarity is mathematically formulated as
$$1 – r\left[ {f\left( {w_{i} } \right),\,mean\left[ {f\left( {w_{i – 1} } \right),f\left( {w_{i – 2} } \right), \ldots ,f\left( {w_{2} } \right),f\left( {w_{1} } \right)} \right]} \right]$$
where r represents Pearson’s correlation and f(w) represents a word’s feature vector.
Syntactic feature: syntactic surprisal
Syntactic surprisal was derived at the word level and subsequently mapped to word onsets for TRF estimation. Syntactic surprisal represents how surprising the part of speech (POS) tag of the current word is given the preceding words. A word’s syntactic surprisal is defined as the log probability of its POS tag conditioned on previous text65, where the next-word probability distribution was extracted using GPT270. As with semantic dissimilarity, GPT2 was chosen because of its contextualized word representations. To form the next-word probability distribution with GPT2, the text preceding the current word was fed into GPT2, which outputted logits. A softmax was applied to the logits to form a probability distribution. From this distribution, we decoded using the nucleus sampling algorithm70 with p = 0.9 (i.e., the smallest set of next-word predictions such that the cumulative probability was 0.9). Each word in this nucleus sample was then tagged with the POS tagger from SpaCy’s en_core_web_lg model (https://spacy.io/models/en#en_core_web_lg). From this, counts of each POS tag were computed and then normalized to form the POS tag probability distribution. For the ith word of a text, its syntactic surprisal can be mathematically formulated as
$$ln\left[ {p\left[ {pos\left( {w_{i} } \right)\left| {w_{i – 1} ,w_{i – 2} , \ldots ,w_{2} ,w_{1} } \right.} \right]} \right]$$
where \(p(pos\left( {w_{i} } \right)\left| {w_{i – 1} ,w_{i – 2} , \ldots ,w_{2} ,w_{1} } \right.)\) is computed from the nucleus sampling outlined above.
TRF modeling
TRF estimation was conducted using EEG data that were z-scored to each participant’s mean across channels. TRFs were constructed to map each track’s acoustic or linguistic features to a participant’s corresponding EEG data, with separate TRFs estimated for each acoustic and linguistic feature. For the multiband envelope, TRFs were estimated separately for each of the 16 frequency bands, then averaged across those bands. For the broadband envelope and linguistic features, a single TRF was estimated. Each TRF was estimated by minimizing the least-squares distance between EEG values predicted from a given feature and the participant’s observed EEG data. Time lags of − 500 to 1000 ms were used. TRFs were derived using regularized linear ridge regression and validated using leave-one-out cross-validation, implemented in the mTRF Toolbox71. The resulting TRFs represented a vector of beta weights that were then used as input to the ML algorithms described below.
Classification
Classification tasks
The broadest task was to classify each participant as either a control participant or an individual with PPA (controls vs. PPA). Differential classification across participant groups (four-way classification, controls vs. svPPA vs. lvPPA vs. nfvPPA) and by PPA subtype (three-way classification, svPPA vs. lvPPA vs. nfvPPA) was also pursued. Additionally, we sought to classify one type of PPA by ruling out the other two types of PPA (svPPA vs. nfvPPA and lvPPA; lvPPA vs. svPPA and nfvPPA; nfvPPA vs. svPPA and lvPPA), which would be clinically useful in cases where overall PPA diagnosis is conferred and one PPA subtype is suspected. This is also a common methodology for multiclass classification that enables superior performance by ML classification algorithms. Lastly, we sought pairwise (two-way) classification by PPA subtype (svPPA vs. lvPPA; svPPA vs. nfvPPA; and lvPPA vs. nfvPPA), which would be useful in cases where PPA diagnosis is conferred and narrowed down to one of two possible subtypes.
Reading in EEG and TRF data
All participants had EEG data from 30 EEG channels, but only data from channel Cz (10–20 electrode system placement72) were fed into ML classification algorithms (see ML classification algorithms) because a single vector concatenating all channels (i.e., 30 channels × 8307 timestamps) would be too large for our computational constraints. Channel Cz was selected based on its common use for analysis and display purposes in previous TRF literature73,74. Further, it is not as susceptible to bias by hemispheric differences, which is particularly important in a population like PPA, where there is asymmetric neurodegeneration. Lastly, Cz has also been linked to language-related ERPs, such as the N40075,76. Participant-level data were reorganized into track-level data, resulting in 1095 tracks (33 participants with 30 tracks and 7 participants with 15 tracks) that were used for training and evaluating the ML classification algorithms. The number of data points (1095) used for training exceeds any in the literature on automated approaches to PPA classification. The number of data points for each subgroup overall is presented in Supplementary Table 1. The results reported in the main text reflect classification performance at the track-level. In the Supplementary Materials, we also report the classification performance when track-level predictions are merged into individual-level predictions (Supplementary Table 2).
TRF beta weights were available for every audio track. As with EEG data, each participant’s channel Cz TRF beta weights were used to build a ML-based classifier. Standardization of both TRF and EEG data is discussed in the “ML classification algorithms” section. We note that the input to the model was a single vector, both for each classification task’s TRF-based model and the EEG-based model. The “Hyperparameter tuning” section describes the process used to select the single acoustic/linguistic feature and the single ML classification algorithm used in each classification task’s model.
ML classification algorithms
It is common practice to test a variety of classification algorithms to achieve the best classification performance77,78,79,80,81,82. In this study, we evaluated nine ML classification algorithms from the Python ScikitLearn package83: decision tree, random forest, extremely randomized trees (aka ExtraTrees), SVM, kNN, logistic regression, Gaussian Naive Bayes, Adaboost, and Multilayer Perceptron (MLP). This is similar to the seven ML classification algorithms used by28 for PPA classification. Note that kNN, SVM, and MLP required prior scaling as these algorithms are based on the notion of distance between data points; scaling here refers to standardizing all input TRF/EEG by subtracting the mean and scaling to unit variance. The other six ML classification algorithms did not require any preprocessing of the TRF beta weights or EEG data as they are not based on distance between data points.
Cross validation
At the participant level, the data were split into 5 stratified outer folds, where 80% of each fold was designated for training and 20% was designated for testing. Special care was taken to ensure this was done at the participant level instead of the track level so that results generalize across individuals. In other words, all tracks for a given participant were either in the training set or the test set (not both). The classifier’s predictions on each outer fold’s test set were merged to form a set of predictions for all data points, which were then compared to ground truth (see “Analyzing model performance”). This use of cross validation ensures the reported results are applicable across all participants in our sample. This is in contrast to the 80–20 train-test split, where the classifier would be trained on 80% of the data and only evaluated on 20% of the data (i.e., results only reflect a fifth of the dataset). Our decision to use cross validation instead of train-test split is motivated by the small N of our dataset.
Hyperparameter tuning
For each classification task’s model, we built a classifier for each possible combination of EEG frequency band, single acoustic/linguistic feature used to derive TRF weights, and single classification algorithm into which the TRFs were fed. The combination that resulted in the best performance on the nested cross-validation per classification task is reported in Tables 2, 3, 4, 5, 6. The classification performance for all classifiers constructed (i.e., each combination of frequency band, acoustic/linguistic feature, and classification algorithm) is reported for each classification task in Supplementary Tables 3–12, and the best classification performance for delta and theta bands, specifically, is reported in Supplementary Table 13. The percentage of classifiers outperforming the random sampling baseline is reported in Supplementary Table 14 (see “Analyzing model performance”). For each classifier built, we used 5-fold nested cross validation to determine the internal hyperparameters of the ML classification algorithm. For each of the five outer folds, its training set is split into five stratified inner folds (i.e., running a 5-fold cross validation on an outer fold’s training set, where 80% of each fold’s training set is designated for training and 20% is designated for validation). When evaluating a particular set of hyperparameters, classification performance was computed for each inner fold (i.e., trained on the inner fold’s training set and evaluated on the inner fold’s validation set) and then averaged. This process was repeated for several sets of hyperparameters, from which the best performing hyperparameters were identified. Note that only the outer fold’s training set was used to determine the best hyperparameter combination. Then, only the best performing hyperparameters were used to train a model on all of the outer fold’s training data, which was evaluated on the outer fold’s test set (which was not seen/used in the hyperparameter tuning process, thus giving an unbiased estimate of the hyperparameter’s true performance). This process was then repeated for the 2nd outer fold and so on, where each outer fold may select different hyperparameters from its training set, which was then evaluated on its test set. Finally, each outer fold’s test set predictions were merged to form a set of predictions for all data points, which were then compared to ground truth (see “Analyzing model performance”). This nested cross validation process allows us to optimize each classifier’s hyperparameters without compromising the validity of its evaluation and generalization to new patients. In sum, for each classification task, the inner folds were used for selecting the model’s best hyperparameter combination and the outer folds were used for final evaluation of the model itself.
Analyzing model performance
Recall, precision, and F1 score were metrics of interest. A class’ recall reflects the proportion of true positive cases predicted as positive relative to all true positive cases (e.g., how many individuals with PPA were classified as having PPA). Precision reflects the proportion of true positive cases predicted as positive relative to all predicted positive cases (e.g., for all samples classified as PPA, how many actually had PPA). Lastly, F1 score reflects the harmonic mean of its precision and recall, ranging from 0 to 1, where 1 reflects perfect classification. F1 was used to evaluate each model’s performance in lieu of accuracy for two reasons. First, for many of the selected classification tasks, there was an uneven class distribution; for example, for the classification task of svPPA vs. lv/nfvPPA, there were twice as many lv/nfvPPA samples as svPPA samples. Using the macro (i.e., unweighted) average of each class’ F1 is ideal for use in situations where there is class imbalance because it gives equal weighting to both the dominant and non-dominant class, avoiding artificial inflation of the F1 score by the dominant class (which could potentially have a higher F1 score). Using accuracy, as many previous studies have done, can result in a classifier achieving seemingly good performance by always predicting the dominant class; for example, given that there are three times as many PPA samples as controls, our classifier for controls vs. PPA would achieve 75% accuracy by classifying every sample as PPA. For F1, however, this would correspond to a much lower score. Unlike accuracy, F1 also balances the need for simultaneously good precision and recall. To show that our classifiers achieved meaningful, above-chance performance, F1 scores were derived by randomly sampling each prediction using the uniform distribution and the sample-label distribution (Supplementary Table 14). These baselines were computed through ScikitLearn’s DummyClassifier model, where its strategy parameter was set to either “uniform” or “stratified”.
For all classification tasks, McNemar tests from the mlxtend package84 were used to compare the best EEG-only model that used EEG waveforms as input against the best model that used TRF beta weights as input in order to determine whether the derivation of the TRF beta weights provided additional benefit to classification performance. From the predictions of the best TRF-based and the best EEG-only classifiers, a 2 × 2 contingency table was formed using the mcnemar_table function from the mlx_extend package. From this contingency table, the McNemar test statistic and corresponding p-value was computed using the mcnemar function from the mlextend package.
