Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments

Machine Learning


Datasets description

The dataset employed in this study is the widely used Distress Analysis Interview Corpus with Wizard-of-Oz (DAIC-WOZ)39 and CMDC40.

DAIC-WOZ dataset comprises 189 clinical interviews meticulously crafted to facilitate the diagnosis of psychological distress conditions, including anxiety, depression, and post-traumatic stress disorder. The dataset is divided into a training subset (107 interviews), a development subset(35 interviews), and a test subset(47 interviews), amounting to a total of 50 h of data. The majority of prior studies conduct validation on the development set. For the sake of result comparison, our experiments are conducted on both the training subset and the validation subset. The collected data is multimodal, encompassing text, images, and speech information, with a focus on utilizing speech information as the experimental data. Each speech segment has an average length of 15 min, and a consistent sampling rate of 16 kHz is maintained throughout the dataset.

The CMDC dataset is a clinical depression dataset based on confirmed cases in Chinese language corpus, aiming to support screening and assessment of severe depression in China. This dataset also includes semi-structured interviews covering visual, auditory, and textual features. Unlike the DAIC-WOZ dataset, the CMDC dataset has predetermined twelve fixed questions during the interview. The CMDC dataset consists of 78 samples, including 26 cases of severe depression patients and 52 healthy individuals. Compared to DAIC-WOZ, the CMDC dataset is smaller in scale, further highlighting the scarcity of depression data.

Evaluation metrics

Each participant contributes PHQ-8 scores, along with dichotomous labels. The PHQ-8 score indicates the degree of depression for each subject, while dichotomous labels signifies whether the subject is classified as depressed. The central aim of this paper is to predict whether the subject is a depression patient. Consequently, the evaluation metrics utilized in this study include accuracy (P), recall (R), and F1 score, area under the curve (AUC). The higher the value, the better the performance.

Experimental settings

All experiments were executed on the Linux operating system, utilizing an NVIDIA V100 GPU, and implemented using the PyTorch framework. To optimize the fine-tuning of the audio pre-training model, we employed a small learning rate of 1e-5, while a learning rate of 0.006 was utilized for downstream tasks. The optimization process involved using an Adam optimizer with a weight decay of 0.001. The batch size of the training subset was set to 32, and the number of training epochs was defined as 200. The training process featured an automatic termination mechanism, activated when the model’s performance on the validation set showed no significant improvement over 10 consecutive training epochs. Additionally, we employed the OpenSMILE tool to extract the IS09 emotion acoustic feature set. This set comprises 16 low-level descriptors (LLDs), such as Mel-frequency cepstral coefficients and zero-crossing rate, resulting in 32 LLDs computed by first-order differences. Subsequently, 12 statistical functions were applied to these descriptors to derive a 384-dimensional sentence-level feature representation. We utilized this representation for comparison with our fine-tuned features.

Comparison with other methods

In this section, we conducted experiments comparing the DAIC and CMDC datasets in two different languages, as well as comparing the effects of different input features under the same model, to verify the robustness and effectiveness of our approach.

Performance evaluation on the DAIC-WOZ dataset

Table 1 presents a comprehensive comparison of our proposed method with recent approaches for depression detection based on speech, particularly on the DAIC-WOZ dataset. Our method achieves superior performance in terms of precision and F1 score, attaining values of 84.49% and 79.00%, respectively. In contrast to methods such as Chlasta et al.41, who generates additional training samples by cutting and randomly sampling audio files, and Rejaibi et al.29, who adopts a transfer learning strategy by pretraining on the RAVDESS database, our approach surpasses them, showcasing enhanced performance. Moreover, Othmani et al.42 address sparse data issues through audio augmentation techniques, yet our model outperforms them significantly, exhibiting an average 16.62% higher F1 score. This superiority is attributed to our use of the more generalizable pretraining model, wav2vec2.0, extensively trained on large-scale datasets, enabling more accurate capture of key features in speech data. Comparisons with Ravi et al.43, who use the Wav2vec2.0 model as a feature extractor and employ adversarial learning, demonstrate our model’s outperformance in F1 score by 9.8%. This underscores the effectiveness of fine-tuned features in enhancing performance.

Table 1 A comparison of the proposed method with other methods for SDD on DAIC-WOZ dataset. Boldface highlights the highest score.

In contrast to Du et al.21, who extract MFCC and LPC features and use 1D-CNN and LSTM, our similar structure achieves significant improvements in precision, recall, and F1 score, outperforming them by 17.79%, 10.29%, and 4.4%, respectively. Examining the confusion matrix in Fig. 4 reveals a notable pattern: our model exhibits a higher count of true positives, contrasting with the comparator model that demonstrates a higher occurrence of false positives. This distinction suggests that our model is more discerning, effectively distinguishing non-depressive states. This increased discriminative ability enhances the model’s reliability for practical applications, contributing to a heightened early detection rate for patients. This highlights the effectiveness of introducing wav2vec2.0 features, addressing the low-resource challenge, and incorporating a self-attention mechanism into the LSTM model to enable the model to ignore redundant information. Finally, despite Zhou et al.26 achieving the highest recall of 83% through the fusion of various descriptors, BoAW, functional features, and spectrograms, their precision and F1 score fall below our model’s performance. Their segmentation approach sacrifices temporal information of the dialogue, while our model successfully retains richer long-term information, resulting in superior precision and F1 score.

Figure 4
figure 4

Comparative analysis of confusion matrices in depression detection: a comprehensive evaluation between the present study (left side) and DU et al. (right side). ND represents non-depression and D represents depression.

Performance evaluation on the CMDC dataset

Table 2 presents the comparison results of our proposed method with recent speech-based depression detection methods on the CMDC dataset. Our method achieved the best performance in terms of precision and F1 score, reaching 94.83% and 90.53%, respectively. Compared to methods using acoustic prosodic features extracted from IS09, our precision increased by 12.51%, recall increased by 10.16%, and F1 score increased by 10.17%.

Table 2 A comparison of the proposed method with other methods for SDD on CMDC dataset.

Comparing the binary classification performance of different acoustic features

In this section, we further compared the binary classification performance of two different features in the same model. Through the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) metric, we evaluated the overall performance of the model. The ROC curve shows the performance of the classifier at different thresholds, where the closer the ROC curve is to the upper left corner, the better the classifier’s performance. From Fig. 5, it can be seen that the fine-tuned wav2vec features are positioned more to the left and have a higher AUC value. It is worth noting that, on the CMDC dataset, although there are some misclassifications, the model performs well in terms of the AUC metric, reaching the highest value of 1, indicating that the model can perfectly rank positive instances ahead of negative instances, showing a high classification ability.

Figure 5
figure 5

ROC curves were generated for various feature inputs using the same model. The left side represents the DAIC-WOZ dataset, while the right side corresponds to the CMDC dataset.

Comparison of different acoustic features

To assess the performance of our model in acoustic feature recognition, we conducted a clustering analysis, focusing on three aspects: the is09_emotion feature set, features extracted by wav2vec2.0, and those extracted by the fine-tuned wav2vec2.0. The is09_emotion feature set offers abundant prosodic features, and the clustering analysis results are shown in Fig. 6(a). As can be seen from the figure, the clustering effect is not satisfactory, with blurred boundaries between clusters, indicating that the model is unable to effectively divide the data into meaningful groups. After clustering the features extracted by the raw-wav2vec2.0 model, the results are presented in Fig. 6(b). Compared to the is09_emotion feature set, there is some improvement, but still many features are incorrectly assigned to the wrong clusters. The fine-tuned wav2vec2.0 achieved significant improvement in feature clustering, and the results are shown in Fig. 6(c). We observed that the feature points clustered into two tightly connected groups, with distinct boundaries between them. This indicates that the fine-tuned wav2vec2.0 model demonstrates enhanced speech representation capability, effectively distinguishing features between individuals with depression and healthy controls.

Figure 6
figure 6

Clustering results of is09_emotion (a), raw-wav2vec2.0 (b) and fine-tuning wav2vec2.0 features (c).

Ablation analysis

In this section, we perform a thorough validation of each module’s functionality through an ablation study of the model modules. The ablation experiments are conducted with a consistent setup, where configurations remain uniform, and variations are constrained to the modules under scrutiny.

Comparison of fine-tuning strategies on depression detection performance

In this section, we meticulously compare the performance of fine-tuned and non-fine-tuned models in the task of speech-based depression detection through A and B experiments. Experiment A employs a pre-trained model without fine-tuning on depression speech data, whereas Experiment B incorporates fine-tuning on the depression speech dataset. This design aims to assess the effectiveness of domain-specific fine-tuning and the direct application of pre-trained models in the target domain. The experimental results are presented in Table 3.

Table 3 Comparison of fine-tuning strategies on depression detection performance.

Foremost, it is crucial to highlight that the A and B experiments demonstrate a noteworthy performance improvement in fine-tuned models compared to non-fine-tuned models. This aligns with expectations, indicating that fine-tuning more effectively captures depression-related speech features, thereby enhancing performance in the task of SDD. Additionally, our observation reveals that the large model outperforms the base model, likely owing to its increased parameter count, allowing for a more comprehensive learning of features in the target domain and subsequently improving depression detection accuracy. This observation is consistent with the prevailing perspective in the field of deep learning, where larger models typically exhibit better performance on complex tasks. Furthermore, we note that the wav2vec 2.0 model, when fine-tuned on the IEMOCAP emotional analysis dataset, using the last layer as feature input, demonstrates good precision but relatively lower recall and F1 score. This underscores the significance of fine-tuning in the depression detection task to more effectively adjust to the speech expression features of the target domain and enhance model performance. Finally, the results suggest that, within the fine-tuning strategy, fine-tuning all layers surpasses the performance of fine-tuning only the last layer. This indicates that, in the depression detection task, adjusting features at deeper levels more comprehensively captures depression-related information in speech data. In contrast, fine-tuning only the last layer may not sufficiently capture domain-specific features, thus limiting performance improvement.

Comparison with different pooling strategies

In addition to fine-tuning, we extended our investigation to compare various pooling strategies. Figure 7 illustrates that attention pooling outperformed max pooling and average pooling in F1 score by 4.69% and 2.26%, respectively. While average pooling has been proven effective in capturing features of the entire speech segment, and max pooling is adept at highlighting the most prominent features within the segment, attention pooling demonstrated superior performance. Unlike average pooling, attention pooling facilitates the model in focusing on important frame information within speech segments, contributing to enhanced model accuracy. In the context of depression detection, a more comprehensive consideration of speech segment information is shown to contribute to improved model performance.

Figure 7
figure 7

The impact of different pooling methods on model performance.

Comparison with and without self-attention mechanism

To evaluate the effectiveness of the self-attention module in selecting valid information in speech segments, we conducted an ablation study by excluding this module from our proposed method. Specifically, in the absence of the self-attention module, we utilized the output of the last step of the LSTM model and connected it to a fully connected classification layer to obtain the classification result. Table 4 illustrates that the model integrated with self-attention surpasses the performance of the model lacking self-attention. This outcome suggests that, in the task of speech-based depression detection, emotional expression may concentrate in specific speech segments, and the self-attention mechanism proves more effective in capturing these crucial pieces of information.

Table 4 Comparison between the model with and without self-attention mechanism.

Comparison of different audio lengths

We selected the analysis of audio segments between 4 and 9 s to explore the influence of different audio lengths on model performance. This range is a commonly used segmentation method in current literature. For each segmentation strategy, we applied the aforementioned fine-tuning method and the optimal model structure for validation. Each segmentation experiment was repeated 5 times, and the averages were taken. The experimental results are depicted in Fig. 8. It is observed that with the increase in audio segment duration, the model performance shows an upward trend before 7 s, reaching a performance plateau around 7 and 8 s. This suggests that shorter speech segments may disrupt the continuity of emotions, while excessively long segments may result in insufficient sample quantity. Considering the impact of audio length on the overall sample size and model computational efficiency, we selected 7 s as the optimal duration.

Figure 8
figure 8

Comparison of performance across different segment lengths.

Discussion and limitation of our work

In this study, we conducted an extensive exploration of the potential application of the audio pre-training model wav2vec 2.0 in the context of SDD. Through comparisons with traditional methods, we validated that the wav2vec model, after transfer learning on tasks with limited speech data, significantly outperforms traditional acoustic feature representations, demonstrating advanced feature representation. This underscores the feasibility of employing speech-based depression detection in low-resource scenarios. Moreover, our implementation of ablation experiments unveiled a critical insight: not all depressed patients exhibit obvious depressive characteristics in their speech, emphasizing the necessity of extracting key information from dialogues. Concurrently, we observed that traditional feature representations often overlook the temporal relationships between frames. To address this, we introduced an attention pooling structure, which, in comparison to traditional statistical functions, more effectively captures the temporal relationships between frames, yielding more expressive sentence-level vector representations for downstream tasks.

Despite these advancements, our work is not without limitations. Firstly, the integration of multiple acoustic features remains an area for improvement. While our study generates depression acoustic features based on wav2vec through transfer learning, the potential benefits of effectively fusing various acoustic features to enhance model performance and robustness cannot be overlooked. Secondly, the real-time aspect of depression detection systems requires addressing. With the prevalence of smart devices and the Internet of Things, future research should prioritize the advancement of real-time speech analysis systems for immediate and personalized depression risk assessment. The key challenge lies in achieving the real-time deployment of complex machine learning technologies45, such as large pre-trained models like wav2vec 2.0. We must explore embedding these large models into real-time analysis solutions and ensure their effectiveness in real-time environments through adaptive data transformations. Solving this issue is crucial for the practical application of depression detection technology in real-world scenarios.

Conclusion and future work

In the realm of speech-based depression detection, this study has yielded significant results through thorough research and optimization of the wav2vec 2.0 model. The comparison between fine-tuned and non-fine-tuned models revealed that fine-tuned models excel in capturing speech features related to depression, consequently enhancing detection performance. Particularly noteworthy is the finding that, within the fine-tuning strategy, fine-tuning all layers surpasses the performance of fine-tuning only the last layer, underscoring the importance of adjusting features at a deeper level to adapt to the task. Regarding model structure, our exploration of different pooling strategies indicated that attention pooling achieves a higher F1 score compared to max pooling and average pooling. The incorporation of attention mechanisms proved instrumental in enhancing model accuracy. Furthermore, the ablation study confirmed the efficiency of the self-attention module in capturing key information within speech segments. This study not only provides guidance for the task of SDD but also imparts valuable experience and insights for employing deep learning in the domain of speech emotion analysis. Our work has not only achieved superior performance in acoustic feature extraction but has also presented an effective approach to address the issue of data sparsity.

Future endeavors will delve into exploring more effective feature extraction methods and strive to integrate multiple acoustic features efficiently, thereby further improving the accuracy and robustness of speech-based depression detection. Additionally, efforts will be directed towards overcoming the challenge of real-time implementation by investigating approaches such as lightweight models or employing model pruning techniques. Finally, because of the high temporal resolution, non-invasiveness, and harmlessness of electroencephalography (EEG)46, we plan to incorporate EEG signals into our considerations and conduct comprehensive analysis in combination with acoustic features. This approach is expected to lead to a more comprehensive and accurate depression detection method, which will provide strong support for early diagnosis, treatment, and intervention of depression, and thereby improve patients’ medical experience and quality of life.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *