Key findings
To the best of our knowledge, this is the first study to examine the value of speech-derived sentiment and linguistic features in detecting UHR. We found that features capturing sentiment variability (i.e., EVA), lexical sophistication (i.e., TAALES), and morphology (i.e., TAMMI), are the most valuable for detecting youths at UHR. Our factor analysis revealed five factors underlying the top features, namely (1) Sentiment Intensity and Variability, (2) Linguistic Register Alignment, (3) Phonographic Uniqueness and Recognizability, (4) Morphological Complexity and Imageability, and (5) Lexical Richness and Typicalness. Additionally, when trained on the top features, our ML models provided a good accuracy (mean AUC = 0.70) in identifying youths at UHR.
Comparison with related studies
Our model did not score higher than similar models. Kizilay et al. trained RFCs on semantic similarity features and POS tags extracted from Thematic Apperception Test (TAT) transcripts to detect youths at UHR (AUC = 0.86)9. Corcoran et al. predicted conversion amongst youths at UHR (AUC = 0.87) using LR models trained on semantic cohesion features and POS tags extracted from Kiddie Formal Thought Disorder Story Game (K-FTDSG) transcripts12. Our model’s lower performance could be attributed to the length of our transcripts. HiSoC took 45 seconds, while TAT and K-FTDGS took approximately 8 and 25 minutes, respectively. The shorter transcript lengths could have affected the quality of extracted features, which ultimately limited model performance. Nonetheless, we achieved a modest AUC of 0.70. This result showcased the potential of using short speech samples to detect youths at UHR.
In contrast to similar studies, we did not identify cohesion features as top predictors for UHR detection. Previous studies have reported cohesion features (e.g., semantic similarity, semantic coherence) as top predictors for UHR detection and conversion, respectively9,12. One explanation is that cohesion features, including the presence and absence of semantic overlaps, may be more apparent and better captured in longer transcripts. Given the short speech samples analyzed here, we found that sentiment variability, lexical sophistication, and morphology are important.
Sentiment and linguistic characteristics of UHR speech
All features from the Sentiment Intensity and Variability factor were negatively associated with UHR, suggesting that UHR speech is characterized by diminished positive sentiments and persistent periods of negative sentiments. This finding echoes that of Olson et al., where they reported an association between UHR and the use of words with more negative emotional tones31. Diminished sentiment intensity and variability in speech could be a manifestation of anhedonia—a core negative symptom of schizophrenia32, which is also observed in UHR33,34. It is also important to note that anhedonia has been reported as a key predictor of social and occupational functioning in high-risk youths35.
All features from the Linguistic Register Alignment factor were negatively associated with UHR, suggesting that UHR speech deviates from typical spoken, fiction, and magazine registers. Divergence from normative linguistic patterns may point towards language disorganization which may be seen in UHR36 and psychosis6. Aberrant linguistic styles, such as higher levels of peculiar word usage, peculiar sentence construction, and peculiar logic in speech, have been observed in UHR, and manifests more strongly in psychosis36.
WN_SD_CW and OG_N from the Phonographic Uniqueness and Recognizability factor had positive and negative associations with UHR, respectively, suggesting that UHR speech is characterized by less recognizable and more phonographically unique words (i.e., words with few phonographic neighbors). The use of less recognizable words may point towards unusual and abnormal linguistic styles observed in psychosis37. Regarding phonographic uniqueness, OG_N captures the number of phonographic neighbors of each word relative to ELP’s lexicon instead of the transcript. Hence, we could not comment on the degree of phonological association within the transcripts. However, we would expect UHR speech to be characterized by phonologically similar words as clanging is associated with individuals with schizophrenia38.
MRC_Imageability_CW and Inflected_Tokens from the Morphological Complexity and Imageability factor were negatively associated with UHR, suggesting that UHR speech contains fewer imageable words and is morphologically less complex. Our finding on word imageability is surprising since the use of highly imageable words has been reported in individuals with schizophrenia39. However, this could be due to differences in the nature of the speech tasks. Our study employed HiSoC where participants responded freely to an open-ended question. In the referenced study, participants took part in the Figurative Language 2 (FL2) task from the Assessment of Pragmatic Abilities and Cognitive Substrates test, where they had to explain idioms, metaphors, and proverbs39. FL2 was designed to assess an individual’s ability to infer non-literal meanings from figurative language40. Due to difficulties in understanding figurative language, individuals with schizophrenia scored poorly on FL2 and used more imageable words in their explanations39. However, in open-ended tasks, the use of imageable words amongst individuals with psychotic disorders is not widely studied. More studies are required to elucidate associations between word imageability and psychotic disorders. On the other hand, previous studies have reported reductions in complex morphological structures in speech, such as verbs, adjectives, and adverbs, amongst patients with schizophrenia41, and their poor performance in morphology tasks, such as past tense production42. Moreover, suffix_freq_per_cw from the Morphological Complexity and Imageability factor was positively associated with UHR, which further suggests that UHR speech is characterized by simpler morphology.
LD_Mean_Accuracy_CW and COCA_spoken_tri_2_MI from the Lexical Richness and Typicalness factor had negative and positive associations with UHR, respectively, suggesting that UHR speech contain less typical words and more predictable lexical structures (n-gram combinations). The use of atypical words could reflect underlying thought impairments, which have been observed in psychosis37. On the other hand, the use of predictable lexical structures suggests a lack of lexical richness, pointing towards poverty of content (POC). POC is associated with conversion amongst youths at UHR10 and has been identified as a key predictor of psychosis onset12. Furthermore, amongst patients with formal thought disorder, those with POC tend to have poorer long-term prognosis43.
Large language models for UHR detection
Large language models (LLMs) are powerful text-based models with immense potential for clinical decision support (CDS). LLMs have powerful deductive capabilities and is likely to supersede current NLP techniques (if not already). However, they are complex black boxes and prone to hallucination. While LLMs are surprisingly good at certain clinical tasks, it is unknown if LLMs are competent in performing mental health diagnoses directly from speech data. Using the same UHR classification task, we compared our Boruta model against the untuned Llama 3.18B-Instruct model. To coax the LLM into its decision support task, we developed three levels of prompt engineering: First, we prompted the LLM to act as a CDS tool for predicting whether an individual is at UHR based on the speech transcript. This LLM achieved a balanced accuracy of 0.48 (slightly worse than chance). Next, we prompted the LLM to list the UHR criteria based on CAARMS and to use those criteria to evaluate the transcripts. The LLM was able to correctly list the criteria and scored a higher balanced accuracy of 0.50 (but still performs no better than chance). Finally, we performed few-shot prompting by specifying two transcript examples from each class (UHR positive and negative). This did not achieve much better results (balanced accuracy of 0.52). The structure and example of each prompt level are detailed in Supplementary Table 3.
Although the LLM did progressively improve with increasing levels of prompt engineering, its best accuracy is only marginally higher than that of a random classifier based on our data. These early findings suggest directly tuning out-of-box LLMs for speech-based diagnosis may be insufficient. We suspect that training corpora with explicit mental health labels may be lacking due to high sensitivity concerns. This also implies such LLMs cannot be relied on for meaningful clinical explanations. Although disappointing and surprising given recent published literature, we believe this also highlights an exciting gap for future research where we may develop new ways to train LLMs–possibly, by adding more modalities and examples to finetune the model. There are also exciting recently published local mental health LLM models trained on social media data that we can also evaluate using our data44,45. Finally, given the deep domain-expertise incorporated into existing NLP methods, there may be opportunity to integrate these NLP features with LLMs to enhance model explainability and performance, bringing us closer towards speech-based CDS systems for mental health.
Limitations
Class imbalance issues
Amongst the 429 speech transcripts available, only 80 belonged to the UHR positive group. This major imbalance between UHR positive and negative samples could result in uneven data distribution during train-test partitioning, leading towards model underfitting. To alleviate this issue, we used leave-one-out cross-validation to ensure that most of the UHR positive samples are included in the training data in each iteration. This prevents the model from training on radically small and different subsets of the UHR cohort. This reduces model bias and performance instability.
Cohort and data heterogeneity
While informative, the shortlisted features may not guarantee generalizable performance since our cohort is small and local. Additionally, participant behavior on the HiSoC task was highly varied, despite standardized instructions and a 45-second limit. Sparse speech content presented further challenges for our model. Moreover, some reference corpora (e.g., COCA) may be grammatically different from Singlish—the colloquial form of English used in Singapore. Such linguistic differences that are present in the transcripts cannot be captured by the extracted features. While LLMs pre-trained on Singlish texts may alleviate this issue, we ultimately opted for ML models trained on established NLP features as they are better equipped for model explainability. In future work, we can explore the integration of culturally or socially contextualized speech constructs to further enhance model performance.
Duration of HiSoC task
Participants were only given 10 s to prepare and 45 s to respond. These relatively short preparation and response times may serve to induce anxiety in participants and capture their responses to immediate stimuli. However, different participants may experience different levels of anxiety, which could in turn, be a confounding variable. Furthermore, the short duration of the HiSoC task may limit our ability to fully assess the participants’ mood, thought patterns, and cognitive processes.
Limitations of extracted features
The WN_SD_CW values were derived from the ELP word naming task and do not represent the word naming latencies of our participants. Thus, we could not compare the word naming latencies between the UHR and control groups. Similarly, the OG_N values were retrieved from the ELP lexicon and do not capture the degree of phonological association within the participants’ transcripts. This information could be useful for detecting clanging, which is associated with individuals with schizophrenia38.
Impact and future work
Having tested and evaluated an extensive set of sentiment and linguistic features, we now know which representations of speech transcribed text are most informative for UHR detection. While challenging, developing text-based technologies is highly rewarding given the prevalence of textual data in online interactions and real-world applications (e.g., social media platforms, online help portals). Upon achieving robust modeling of text features, we can incorporate other key communication modalities (e.g., facial expressions, gaze, body movements) to create more comprehensive and powerful multimodal models, which could provide a more holistic representation of each patient and give rise to clinically relevant explanations.
