Trade-offs between machine learning and deep learning for mental illness detection on social media

This section outlines the methodological framework of our study, covering data collection, preprocessing, model construction, and evaluation metrics. All experiments were conducted using Python 3. We leveraged key libraries such as pandas for data processing, scikit-learn and lightgbm for ML, PyTorch for DL, and Transformers for pre-trained language models. These tools facilitated efficient data handling, systematic hyperparameter tuning, and rigorous performance evaluation. All models were trained on Google Colab, utilizing a high-RAM configuration powered by an NVIDIA T4 GPU, which provided the computational efficiency required for computational tasks, especially DL models. Complete code, including preprocessing scripts, dataset splits, hyperparameter settings, and reproducibility instructions (e.g., Colab notebooks), is available on GitHub. The following sections detail each stage of our approach.

Data preparation

An extensive and varied dataset is fundamental for effective mental health detection via ML. We employed the ‘Sentiment Analysis for Mental Health’ dataset available on Kaggle, chosen for its comprehensive coverage of mental health conditions including depression, anxiety, stress, bipolar disorder, personality disorders, and suicidal ideation. This dataset was selected because it aggregates data from multiple social media platforms (e.g., Reddit, Twitter, and Facebook), thereby capturing a wide range of linguistic styles and demographic variations that closely reflect real-world scenarios. Data were primarily obtained from these platforms where individuals discuss personal experiences and mental health challenges. The data acquisition process involved using platform-specific APIs and web scraping, followed by removing duplicates, filtering out spam or irrelevant content, and standardizing mental health labels. Personal identifiers were also removed to adhere to ethical standards, resulting in a well-structured CSV file with unique identifiers for each entry.

The dataset was compiled and cleaned by the original authors on Kaggle, who aggregated data from multiple sources, removed duplicates and personal identifiers, and standardized mental health labels. Data were primarily obtained from these platforms where individuals discuss personal experiences and mental health challenges. After acquiring the dataset, we applied an automated preprocessing pipeline using Python to further clean the text by removing HTML tags, URLs, special characters, converting text to lowercase, and lemmatizing tokens. These steps were fully automated using established NLP tools (e.g., NLTK), with no manual relabeling or filtering applied.

Despite its diversity, the dataset presents challenges for natural language processing due to its varying demographics and language styles (e.g., slang and colloquialisms), which our preprocessing pipeline was specifically designed to address. These preprocessing steps-including normalization, lemmatization, and stopword removal-help reduce lexical variation and standardize informal and colloquial language across user-generated text. However, we acknowledge that such preprocessing cannot fully address deeper linguistic nuances such as sarcasm, irony, or contextually implied meaning, which remain challenging for both traditional and deep learning models. Overall, the dataset’s extensive coverage and inherent real-world diversity not only present a rigorous challenge for both ML and DL methods but also make it an ideal benchmark for systematically comparing these approaches. The variability in language and demographic factors allows us to assess the strengths and limitations of explicit feature engineering in ML as well as the hierarchical representation capabilities of DL, thereby enhancing the robustness and generalizability of our performance evaluations.

We applied a consistent preprocessing pipeline to prepare the dataset for both ML and DL models. Initially, we cleaned the text by removing extraneous elements such as URLs, HTML tags, mentions, hashtags, special characters, and extra whitespace. The text was then converted to lowercase to maintain consistency. Next, we removed common stopwords using the NLTK stopword list¹⁹ to eliminate non-informative words. Finally, lemmatization was used to reduce words to their base forms, ensuring that different forms of a word are treated uniformly. The processed dataset was randomly split into training, validation, and test sets, with 20% allocated for testing. The remaining data was further divided into training (75%) and validation (25%) sets to ensure reproducibility and optimize model tuning.

For classification, the dataset labels were structured in two distinct ways. In the multi-class scenario, the original labels in the Kaggle dataset were directly used, consisting of six categories: Normal, Depression, Suicidal, Anxiety, Stress, and Personality Disorder. For binary classification, all non-Normal categories were grouped under a single ‘Abnormal’ label.

In natural language processing, feature extraction depends on the model type. ML models require structured numerical representations, while DL models can process raw text sequences or dense vector embeddings.

For ML models, text is commonly converted into numerical features using techniques such as the bag-of-words (BoW) model²⁰, which represents documents as token count vectors but treats all words equally. To address this limitation, Term Frequency-Inverse Document Frequency (TF-IDF)²¹ enhances BoW by weighting words based on their importance-emphasizing informative terms while downplaying common ones. In this study, we employed TF-IDF vectorization to extract numerical features, incorporating unigrams and bigrams and limiting the feature space to 1,000 features to optimize computational efficiency and mitigate overfitting.

Model development

A variety of ML and DL models were developed to analyze and classify mental health statuses based on textual input. Each model was selected to capture different aspects of the data, ranging from simple linear classifiers to complex non-linear relationships. For ML models, we used TF-IDF-based feature engineering, which is commonly employed in text classification tasks for its interpretability and computational efficiency. For DL models, we adopted raw text inputs with embedded representations, allowing these architectures to learn contextual features directly from the data. We did not incorporate hybrid approaches (e.g., using pre-trained embeddings with ML classifiers) because our primary aim was to benchmark standard, widely adopted ML and DL pipelines without introducing additional architectural complexity. This decision allowed us to maintain a clean and interpretable comparison between the two modeling paradigms.

Within the DL category, we selected ALBERT and GRU to represent two distinct neural architectures. ALBERT, a lightweight and efficient variant of BERT, was chosen for its strong performance and lower computational cost, making it well-suited for our Colab-based experimental environment. GRU was selected over LSTM due to its simpler gating mechanism and faster training time, while still being effective at capturing sequential dependencies in text. Although alternative models such as standard BERT or transformer-based models like T5 offer powerful capabilities, our selected models reflect a practical trade-off between performance, interpretability, and resource efficiency within the context of this comparative study. The following subsections outline the methodology of each model and its performance in binary and multiclass classification.

Logistic regression

Logistic regression is a fundamental classification technique widely used in social science and biomedical research²². It models the probability of a categorical outcome based on a weighted linear combination of input features. Despite its simplicity, logistic regression is still effective when applied to high-dimensional data, such as term frequency-based representations in natural language processing.

In this study, logistic regression served as an interpretable model that integrated various predictors (e.g., term frequencies) to estimate the probability of different mental health outcomes. The binary model predicts the likelihood of a positive case, while the multi-class extension accommodates multiple categories.

To prevent overfitting, model parameters were optimized using cross-entropy loss with regularization. A grid search was employed to fine-tune hyperparameters, including regularization strength, solver selection, and class weights, with the weighted F1 score guiding the selection process. The logistic regression models were implemented using the LogisticRegression class from scikit-learn.

Support vector machine (SVM)

Support Vector Machines (SVMs) are effective classifiers that identify an optimal decision boundary (hyperplane) to maximize the margin between classes²³. Unlike probabilistic models such as logistic regression, SVMs utilize kernel functions to map input data into higher-dimensional spaces, allowing them to model both linear and non-linear relationships. Due to the high-dimensional and sparse nature of text-based features, we evaluated both linear SVMs and non-linear SVMs with a radial basis function (RBF) kernel. Model selection was based on the weighted F1 score. Hyperparameter optimization was conducted via grid search, including regularization strength, class weighting, and \(\gamma\) for RBF kernels^{Footnote 1}.

The final models were implemented using the SVC class from scikit-learn. For multi-class classification, the One-vs-One (OvO) strategy was employed, the default approach in SVC, which constructs pairwise binary classifiers for each class combination, with the final label determined through majority voting.

Tree-based models

Classification and Regression Trees (CART) are widely used for categorical outcome prediction in classification tasks. The algorithm constructs a binary decision tree by recursively partitioning the dataset based on predictor variables, selecting splits that optimize a predefined criterion. Common impurity measures, such as Gini impurity and entropy, assess split quality, with lower values indicating greater homogeneity within a node²⁴. The tree expands iteratively until stopping conditions, such as a minimum node size, maximum depth, or impurity reduction threshold, are met.

To prevent overfitting, pruning techniques²⁵ reduce tree complexity by removing splits with minimal predictive value, enhancing generalizability. However, standalone CART models often overfit, making them less suitable for complex classification tasks. Instead, this study employed ensemble methods, such as Random Forests and Gradient Boosted Trees, to improve robustness and predictive performance.

Random Forests Random Forests aggregate multiple decision trees to enhance classification performance. Each tree is trained on a bootstrap sample, ensuring diversity, while a random subset of features is considered at each split to reduce correlation and improve generalization²⁶. Unlike individual trees, Random Forests do not require pruning, with complexity managed through hyperparameters such as the number of trees, tree depth, and minimum sample requirements.

Hyperparameter tuning via grid search optimized the number of estimators, tree depth, and minimum split criteria, using the weighted F1 score as the primary evaluation metric to address class imbalance. The best-performing binary classification model effectively distinguished between Normal and Abnormal mental health statuses. For multi-class classification, the same hyperparameter grid was used with a refined search scope for efficiency, ensuring balanced classification performance across mental health categories.

Beyond predictive accuracy, feature importance analysis provided insights into key variables influencing classification decisions, enhancing model interpretability. Random Forest models were implemented using RandomForestClassifier from scikit-learn, with hyperparameter tuning via grid search on the validation set.

Light Gradient Boosting Machine (LightGBM) LightGBM is an optimized gradient-boosting framework designed for efficiency and scalability, particularly in high-dimensional datasets. Unlike traditional Gradient Boosting Machines (GBMs), which sequentially refine predictions by correcting errors from prior models, LightGBM employs a leaf-wise tree growth strategy, enabling deeper splits in dense regions for improved performance²⁷. Additionally, histogram-based feature binning reduces memory usage and accelerates training, making LightGBM faster and more resource-efficient than standard GBMs²⁸.

Grid search was used to optimize hyperparameters, including the number of boosting iterations, learning rate, tree depth, number of leaves, and minimum child samples. To address class imbalance, the class weighting parameter was tested with both ‘balanced‘ and ‘None‘ options. Model selection was guided by the weighted F1 score, ensuring balanced classification performance.

For binary classification, LightGBM effectively distinguished between Normal and Abnormal statuses. For multi-class classification, it predicted categories including Normal, Depression, Anxiety, and Personality Disorder. Evaluation metrics included precision, recall, F1 scores, confusion matrices, and one-vs-rest ROC curves. LightGBM’s built-in feature importance analysis further enhanced interpretability by identifying key predictors. The models were implemented using lightGBMClassifier from the lightgbm library, with hyperparameter tuning via grid search on the validation set.

A lite version of bidirectional encoder representations from transformers (ALBERT)

ALBERT²⁹ is an optimized variant of BERT³⁰ designed to enhance computational efficiency while preserving strong NLP performance. It achieves this by employing parameter sharing across layers and factorized embedding parameterization, significantly reducing the total number of model parameters. Additionally, ALBERT introduces Sentence Order Prediction (SOP) as an auxiliary pretraining task to improve sentence-level coherence. These architectural refinements make ALBERT a computationally efficient alternative to BERT, particularly well-suited for large-scale text classification applications such as mental health assessment.

In this study, ALBERT was fine-tuned for both binary and multi-class classification. The binary model was trained to differentiate between Normal and Abnormal mental health statuses, while the multi-class model classified inputs into categories such as Normal, Depression, Anxiety, and Personality Disorder. The pretrained Albert-base-v2 model was utilized, and hyperparameter optimization was conducted using random search over 10 iterations, tuning learning rates, dropout rates, and training epochs. Model performance was evaluated using the weighted F1 score as the primary metric. For the multi-class task, the classification objective was adjusted to predict seven categories, with weighted cross-entropy loss applied to address class imbalances.

ALBERT’s architecture effectively captures long-range dependencies in text while offering substantial computational advantages. Performance optimization was conducted using random hyperparameter tuning within the Hugging Face Transformers framework, leveraging AlbertTokenizer and AlbertForSequenceClassification for implementation.

Gated recurrent units (GRUs)

Gated Recurrent Units (GRUs) are a variant of recurrent neural networks (RNNs) designed to model sequential dependencies, making them well-suited for natural language processing tasks such as text classification³¹. Compared to Long Short-Term Memory networks (LSTMs), GRUs provide greater computational efficiency by simplifying the gating mechanism. Specifically, they merge the forget and input gates into a single update gate, reducing the number of parameters while effectively capturing long-range dependencies.

In this study, GRUs were employed for both binary and multi-class mental health classification. The binary model differentiated between Normal and Abnormal mental health statuses, while the multi-class model predicted categories such as Normal, Depression, Anxiety, and Personality Disorder.

The GRU architecture consisted of three primary components:

Embedding Layer: Maps token indices to dense vector representations of a fixed size.
GRU Layer: Processes sequential inputs, preserving contextual dependencies, with the final hidden state serving as the input to the classifier.
Fully Connected Layer: Transforms the hidden state into output logits corresponding to the classification categories.

To mitigate overfitting, dropout regularization was applied, and weighted cross-entropy loss was used to address class imbalance.

Hyperparameter tuning was conducted via random search, optimizing key parameters such as embedding dimensions, hidden dimensions, learning rates, and training epochs. The weighted F1 score was used for model selection, ensuring robust performance on both validation and test data.

Overall, GRUs effectively captured sequential patterns in text, enabling the extraction of linguistic features relevant to mental health classification. While less interpretable than tree-based models, their efficiency and ability to model long-range dependencies make them well-suited for text classification. The models were implemented using PyTorch’s torch.nn module, incorporating nn.Embedding, nn.GRU, and nn.Linear layers. Optimization was performed using torch.optim.Adam, with class imbalances handled through nn.CrossEntropyLoss.

Evaluation metrics

Classifying mental health conditions, such as depression or suicidal ideation, often involves imbalanced class distributions, where the ‘positive’ class (e.g., individuals experiencing a mental health condition) is significantly underrepresented compared to the ‘negative’ class (e.g., no reported issues). In such cases, traditional metrics like accuracy can be misleading, as a model predicting only the majority class may still achieve high accuracy despite failing to detect minority-class cases. To provide a more comprehensive assessment of classification performance, the following evaluation metrics were used:

Recall (Sensitivity): Captures the proportion of actual positive cases correctly identified. High recall is crucial in mental health detection to minimize false negatives and ensure individuals in need receive appropriate intervention³². However, excessive focus on recall may increase false positives, leading to potential misclassifications.
Precision: Measures the proportion of predicted positive cases that are actually positive. High precision is critical in mental health classification, as false positives can lead to unnecessary concern, stigma, and unwarranted interventions³². However, optimizing for precision alone may cause the model to miss true positive cases, limiting its usefulness.
F1 Score: Represents the harmonic mean of precision and recall, offering a balanced performance measure³³. This metric is particularly useful for imbalanced datasets, ensuring that neither precision nor recall is disproportionately optimized at the expense of the other.
AUC: Assesses the model’s ability to distinguish between positive and negative cases across various classification thresholds. Although AUC provides an overall measure of discrimination performance, it may be less informative in severely imbalanced datasets, where the majority class dominates³⁴.