Detection of Parkinson disease using multiclass machine learning approach

Material

This study leverages a publicly accessible dataset housed by the University of Oxford (UO) repository in collaboration with the National Canter for Voice. Originally designed for general voice disorder research, the dataset encompasses voice recordings from 31 individuals: 23 diagnosed with Parkinson’s Disease (PD) and 8 active controls (AC). Within the PD group, 16 are male and 7 are female, while the AC group consists of 3 males and 5 females. The dataset comprises 195 voice recordings, each captured for 36 s in a sound-treated booth. With the aid of a calibrated microphone positioned 8 cm from the individual’s mouth, recordings were obtained. Each recording is characterized by 24 biomedical voice measurements³³. An average of six recordings were made per participant, with 22 individuals providing six recordings and 9 providing seven. The age of the PD participants ranged from 46 to 85 years (mean: 65.8, standard deviation: 9.8), with diagnoses ranging from 0 to 28 years. For clarity, the “status” column in the dataset identifies individuals with PD as “1” and healthy controls as “0”. This distinction facilitates the analysis and differentiation between the two groups. (https://archive.ics.uci.edu/dataset/174/parkinsons). Table 1 illustrates the detailed description of voice measures in UCI dataset.

Table 1 UCI dataset detailed information from³⁰.

Methods

The proposed ensemble method for classifying Parkinson’s Disease (PD) integrates various machine and deep learning models to enhance classification accuracy and robustness. Initially, meticulous data pre-processing ensures data quality and consistency, followed by the careful selection of relevant features to optimize model performance and mitigate overfitting. Addressing potential class imbalances, the Synthetic Minority Over-sampling Technique (SMOTE) strategically augments the minority class, ensuring a balanced dataset distribution. Subsequently, the RandomizedSearchCV algorithm systematically optimizes the hyperparameters of selected models, including K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Feed-forward Neural Network (FNN), maximizing predictive power. Evaluation metrics such as accuracy, precision, recall, and F1-score are rigorously employed to assess individual model performance. Through ensemble model construction, leveraging predictions from multiple models, the ensemble method capitalizes on the strengths of each constituent model to improve overall PD classification accuracy, offering a promising avenue for more effective patient diagnosis and treatment. Figure 1 depicts the proposed model architecture.

(i) Data Collection: The dataset used in this study consists of voice recordings from individuals with and without Parkinson’s disease. These recordings have been converted into structured CSV format, capturing various vocal features such as pitch, jitter, shimmer, and harmonic-to-noise ratio.

(ii) Preprocessing: To address class imbalance in the dataset, we applied the Synthetic Minority Over-sampling Technique (SMOTE). This technique generates synthetic samples for the minority class, ensuring a balanced distribution of Parkinson’s and healthy cases in the training set.

(iii) Feature Selection: We employed Recursive Feature Elimination (RFE) to identify the most relevant vocal features for Parkinson’s disease classification. RFE iteratively removes the least important features based on the performance of a Support Vector Machine (SVM) model, ultimately selecting a subset of features that contribute most to the classification task.

(iv) Model Development: We developed two primary models: K-Nearest Neighbor (KNN) and Feed-forward Neural Network (FNN).

(v) KNN: This model classifies individuals based on the majority class of their nearest neighbors in the feature space.

(vi) FNN: This neural network consists of an input layer, one or more hidden layers, and an output layer. We optimized the network architecture and parameters using RandomizedSearchCV.

(vii) Hyperparameter Tuning: For both models, we conducted hyperparameter tuning using RandomizedSearchCV, which randomly samples a wide range of parameter combinations to identify the optimal settings that maximize model performance.

(viii) Evaluation: The performance of the models was evaluated using accuracy, precision, recall, and F1-score. These metrics provide a comprehensive assessment of the models’ ability to correctly identify individuals with and without Parkinson’s disease. Figure 2 depicts operational flow diagram of the proposed model.

Pre-processing the Data

Preprocessing is a critical aspect of data processing that helps the model learn the features of the data effectively and remove unnecessary information³⁴. Handling missing values is crucial to ensure the integrity of the dataset and the effectiveness of subsequent analyses. In our study, after importing the dataset into the Google Colab platform as a CSV file using the Pandas package, we conducted a thorough examination for duplicates and missing entries. Missing values can significantly impact model performance if not handled properly. To address this, we employed a combination of imputation strategies based on the nature and distribution of the missing data. For numerical features, we utilized mean imputation, where missing values were replaced with the mean value of the respective feature. This approach is effective when the missing data is deemed to be missing at random, and replacing it with the mean helps maintain the overall distribution of the data. Conversely, for categorical features, we applied mode imputation, replacing missing values with the most frequent category. This ensures that the imputed values are representative of the common trends in the dataset and helps preserve the integrity of the categorical variables. Following the imputation process, we observed that the dataset exhibited an imbalance, with 147 cases of Parkinson’s disease (PD) and 48 healthy controls (HC), representing 75% and 25% of the dataset, respectively. To mitigate potential issues related to model performance, such as underfitting or overfitting, we partitioned the dataset into a 70:30 train/test split. This allowed us to train the model on a sufficient amount of data while reserving a portion for independent validation. Moreover, each feature was individually scaled using StandardScaler, a preprocessing technique that standardizes features by removing the mean and scaling to unit variance. This step ensures that all features contribute equally to the model and prevents features with larger magnitudes from dominating the learning process.

$$Standarization=\frac{a-\alpha }{\beta }$$

(1)

Here, α is referred as mean and β is referred as standard deviation. Seaborn and Matplotlib formed the cornerstone of data visualization in Python, allowing the creation of 2D graphs when coupled with libraries like Matplotlib, Pandas, and NumPy. Noteworthy in our toolkit was Scikit-learn, Python’s formidable machine learning package. Its consistent interface, built on Python, provides tools for dimensionality reduction, grouping, regression, and classification. In essence, our approach encompassed a comprehensive data processing pipeline, leveraging diverse libraries to ensure effective model training and analysis.

Feature selection

In the critical phase of our data processing pipeline, we incorporated a cutting-edge dimensionality reduction technique called SelectKBest. This sophisticated algorithm played a pivotal role in optimizing our dataset by meticulously selecting the eight most informative features. Our strategic decision aimed to enhance the efficiency of subsequent modelling steps by focusing on the most relevant variables. SelectKBest, notable for being the second most widely adopted technique in dimensionality reduction, commands a substantial share of approximately 30% in overall usage. Its popularity underscores its effectiveness and widespread acceptance within the data science community. SelectKBest ranks features based on their “k score,” a metric that gauges the relevance and informativeness of each feature to the target variable. The primary advantage of employing SelectKBest is its unparalleled capability to streamline and purify the dataset, ensuring that only the most relevant and influential features are retained for further analysis and model training. This strategic selection not only contributes to enhanced model interpretability but also significantly accelerates training times. The specific features selected by SelectKBest were determined based on their individual contribution to the predictive task at hand. Features that demonstrated the highest k scores were prioritized for inclusion in the final dataset, while less informative features were excluded. This process of feature selection was guided by the principle of maximizing predictive power while minimizing redundancy and overfitting. By prioritizing and retaining the most crucial information, SelectKBest acted as a catalyst in transforming our data into a more concise and efficient representation, laying a solid foundation for the subsequent stages of our data processing workflow.

SMOTE

SMOTE, which stands for Synthetic Minority Over-sampling Technique, is a powerful and widely used method in the realm of imbalanced machine learning datasets. Designed to address the challenge posed by uneven class distribution, SMOTE focuses on the minority class by generating synthetic instances, effectively balancing the representation of different classes in the dataset³⁵. The core idea behind SMOTE is to alleviate the bias introduced by imbalanced datasets, where the minority class may be underrepresented. Rather than relying solely on the existing instances of the minority class, SMOTE creates synthetic examples by interpolating between the feature vectors of minority class instances. This is achieved by selecting a minority class instance and its nearest neighbours, and then generating new instances along the line segments connecting these neighbours. By introducing synthetic samples, SMOTE not only increases the number of minority class instances but also contributes to a more robust and balanced training set. This aids machine learning models in learning patterns from the minority class, preventing biases and improving overall predictive performance. It is particularly beneficial in scenarios where the minority class contains important and meaningful information that might be overshadowed by the dominance of the majority class. SMOTE has become a standard tool in the toolkit of data scientists and machine learning practitioners when dealing with imbalanced datasets. Its application helps mitigate the challenges associated with skewed class distributions, ultimately leading to more accurate and reliable models across various domains and applications.

$${z}_{j}{\prime}={z}_{j}+\nabla ({z}_{i}-{z}_{j})$$

(1)

$$SMOTE ({D}_{min},N, K)$$

(2)

Here, the components of the equation are defined as follows:

D_min: Represents the minority class instances in the dataset that require over-sampling. N: Denotes the number of synthetic instances to be generated for each minority class instance. k: Specifies the number of nearest neighbours to be considered for each minority class instance during the synthetic sample generation process. The SMOTE equation captures the core elements of the algorithm, emphasizing the targeted over-sampling of minority class instances (D_min) through the generation of synthetic instances. The parameters N and k play crucial roles in determining the quantity and characteristics of the synthetic instances introduced into the dataset. This technique is instrumental in addressing imbalances in class distribution, promoting a more equitable representation of minority and majority classes for improved machine learning model performance.

RandomizedSearchCV: hyper-parameter

RandomizedSearchCV, abbreviated for Randomized Search Cross-Validation, emerges as a robust strategy within the machine learning domain, dedicated to the streamlined refinement of hyperparameters. Departing from conventional grid search methodologies, which meticulously traverse the entire spectrum of hyperparameter combinations within a predefined search space, RandomizedSearchCV injects a dose of unpredictability into the process. This innovative approach involves the specification of hyperparameter distributions instead of fixed values. The optimization journey unfolds through the random sampling of hyperparameter values from these distributions over a predetermined number of iterations. This deliberate infusion of randomness empowers RandomizedSearchCV to traverse a varied landscape of hyperparameter combinations efficiently, presenting a computationally frugal alternative to exhaustive grid searches³⁶.

Integrated seamlessly into the scikit-learn library, RandomizedSearchCV seamlessly aligns itself with machine learning models featuring adjustable hyperparameters. Leveraging the versatility of random sampling, this technique establishes itself as a potent and efficient conduit for hyperparameter tuning. Data scientists and machine learning practitioners find in RandomizedSearchCV a valuable ally, as it not only streamlines the hyperparameter tuning process but also enhances model performance and facilitates superior generalization on unseen data. The incorporation of randomness introduces an element of adaptability, allowing the algorithm to navigate the hyperparameter space dynamically, thus contributing to the iterative refinement of models in a more resource-conscious manner.

$$RandomizedSearchCV (M, P, scoring, cv, {n}_{iter})$$

(3)

Here, the components of the equation are defined as follows:

M: Represents the machine learning model under consideration. This could be any scikit-learn compatible estimator, such as a classifier or a regressor. P: Denotes the hyperparameter space to be explored. Unlike grid search, which enumerates all possible combinations, defines probability distributions for each hyperparameter. scoring: Refers to the evaluation metric used to assess the performance of the model for each hyperparameter combination. Common choices include accuracy, precision, recall, or custom scoring functions. cv: Specifies the cross-validation strategy. This could be an integer (for k-fold cross-validation) or a cross-validation splitter object. n_iter: Represents the number of random combinations to sample from the hyperparameter space. This controls the balance between exploration and exploitation. The RandomizedSearchCV equation captures the dynamic exploration of hyperparameter space, introducing randomness to efficiently search for optimal model configurations while maintaining computational efficiency. Figure 3 depicts the RandomizedSearchCV algorithm.

Classification methods

In our study, we leverage a combination of Machine Learning (ML) and Deep Learning (DL) models to discern between individuals classified as healthy and those diagnosed with Parkinson’s disease (PD). The diverse set of models includes the Kernel Support Vector Machine (KSVM), Random Forest (RF), Decision Tree (DT), K-Nearest Neighbor (KNN), and Feed-forward Neural Network (FNN). These models collectively operate on voice signal features, utilizing the distinctive patterns and characteristics embedded in the audio data to make accurate predictions about the health status of individuals. This ensemble of ML and DL models reflects a comprehensive approach to health classification based on voice signals. By harnessing the strengths of each model, we aim to create a sophisticated and accurate system capable of distinguishing between healthy individuals and those with Parkinson’s disease, contributing to advancements in medical diagnosis and treatment.

Kernel Support Vector Machine

The Kernel Support Vector Machine (KSVM) is a powerful machine learning algorithm commonly used for classification and regression tasks. It belongs to the family of Support Vector Machines (SVMs) and is particularly effective when dealing with non-linearly separable datasets. The primary objective of a KSVM is to find a hyperplane in a high-dimensional space that best separates different classes of data. KSVM finds applications in various fields, including image classification, bioinformatics, and speech recognition, among others. Its ability to handle non-linear relationships in data and make accurate predictions even in high-dimensional spaces contributes to its popularity in the machine learning community. The formulation of the Kernel Support Vector Machine (KSVM) involves the optimization of a decision function that defines a hyperplane in a transformed feature space. The general equation for the decision function of a binary KSVM is expressed as follows:

$$f\left(x\right)=sign\left[{\sum }_{i=1}^{N}{\alpha }_{i}{y}_{i}K\left({x}_{i,}x\right)+b\right]$$

(4)

Here, the components of the equation are defined as follows:

$f\left(x\right)$ referred as decision function that classifies a new instance x based on the sign of the summation. N: The number of support vectors in the training dataset. ${\alpha }_{i}$ is represents the Lagrange multipliers associated with each support vector, determined during the optimization process. ${y}_{i}$ is the class label of the i^th support vector. The kernel function, evaluating the similarity between the i^th support vector x_i and the input instance x in the transformed space. b is the bias term, also known as the threshold, which shifts the decision boundary. The kernel function $K\left({x}_{i,}x\right)$ is a crucial element in the KSVM, determining the transformation applied to the input data. Common choices include the radial basis function (RBF) kernel and polynomial kernel, among others. The KSVM Eq. (4) illustrates how the decision function combines the contributions of support vectors, weighted by their Lagrange multipliers and class labels, to make predictions in the transformed feature space. This formulation allows KSVM to effectively handle non-linear relationships in the data.

Random forest

Random Forest is an ensemble learning algorithm that operates by constructing a multitude of decision trees during training and outputs the mode of the classes for classification tasks or the average prediction for regression tasks. The fundamental idea behind Random Forest is to introduce randomness in both the data and the features used for constructing the individual trees, thereby promoting diversity and improving the overall predictive performance. RF is widely used across various domains due to its flexibility, robustness, and ability to handle complex datasets. It is effective for both classification and regression tasks and is particularly popular in machine learning applications where interpretability, scalability, and high predictive accuracy are essential. The prediction equation for a Random Forest in the context of classification tasks is as follows:

$$\widehat{Y}=mode ({Y}_{1},{Y}_{2},{Y}_{3},\dots ,{Y}_{T})$$

(5)

Here, the components of the equation are defined as follows:

$\widehat{Y}$ is the predicted class for a new instance. ${Y}_{1},{Y}_{2},{Y}_{3},\dots ,{Y}_{T}$ are the individual predictions from each tree in the Random Forest. T is the total number of trees in the ensemble. In a RF, each tree is built independently through bootstrap sampling (sampling with replacement) from the training dataset. Additionally, at each split in each tree, only a random subset of features is considered, introducing randomness and decorrelating the trees. The final prediction is determined by a majority vote among the predictions of all the trees. In the case of regression tasks, the average of the predicted values from all trees is taken. RF algorithm excels in leveraging the collective wisdom of diverse decision trees, each trained on a different subset of the data and features. This ensemble approach leads to robust predictions, mitigating the risk of overfitting associated with individual trees and enhancing the model’s generalization performance.

Decision tree

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It operates by recursively partitioning the data into subsets based on the values of input features, ultimately assigning a class label or predicting a continuous value at each leaf node. The decision tree Eq. (6) captures the essence of the algorithm, emphasizing the recursive decision-making process based on feature values to partition the data and make predictions. Despite their simplicity, decision trees serve as building blocks for more complex ensemble methods, such as Random Forests and Gradient Boosting.

$$DT \left(X,Y\right)=Node (X,Y)$$

(6)

Here, the components of the equation are defined as follows:

DT (X, Y): The decision tree function that recursively constructs nodes based on the input features (X) and target labels (Y). Node (X, Y): A decision node in the tree, representing a split point based on the features. This decision node is constructed by selecting the best feature to split the data, and it leads to further recursive calls to the DT function for the subsets of data created by the split. The DT construction involves selecting features at each node to create decision points that partition the data into subsets. This process is repeated recursively until a stopping criterion is met, such as reaching a maximum depth, having a minimum number of samples in a node, or achieving perfect homogeneity.

K-nearest Neighbour

The KNN algorithm is a supervised machine learning algorithm used for both classification and regression tasks. It operates on the principle of proximity, making predictions for a new data point based on the majority class (for classification) or the average value (for regression) of its k-nearest neighbours in the feature space. The equation for the KNN algorithm can be summarized as follows:

$$\widehat{Y}=majority vote ({Y}_{1},{Y}_{2},{Y}_{3},\dots ,{Y}_{k})$$

(7)

$\widehat{Y}$ is the predicted class for a new instance. ${Y}_{1},{Y}_{2},{Y}_{3},\dots ,{Y}_{k}$ are the class labels or values of the k-nearest neighbours in the feature space. The choice of the distance metric and the value of k are crucial parameters in the KNN algorithm. Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance. KNN is a simple and intuitive algorithm, but its performance can be sensitive to the choice of these parameters and the distribution of the data. It is a non-parametric and instance-based algorithm, meaning it does not make assumptions about the underlying data distribution and relies on the entire dataset for making predictions.

Feed-forward Neural Network

A Feed-forward Neural Network (FNN), also known as a multilayer perceptron (MLP), is a type of artificial neural network designed for supervised learning tasks, such as classification and regression. It consists of an input layer, one or more hidden layers, and an output layer. The term “feed-forward” refers to the flow of information through the network, where data travels from the input layer, passes through the hidden layers, and produces an output without forming cycles or loops. The mathematical representation of a feed-forward neural network involves a series of matrix operations, activation functions, and weight adjustments. Let’s consider a simple two-layer network with one hidden layer:

$${Z}^{1}=X.{W}^{1}+ {B}^{1}$$

(8)

$${A}^{1}=active ({Z}^{1})$$

(9)

$${Z}^{2}={A}^{1}.{W}^{2}+ {B}^{2}$$

(10)

$$\widehat{Y}=active ({Z}^{2})$$

(11)

Here, the components of the equations are defined as follows:

X is the input features. ${W}^{1}$ are the weights of the connections between the input layer and the hidden layer. ${B}^{1}$ are the bias terms for the hidden layer. ${Z}^{1}$ are the weighted sum of inputs at the hidden layer. ${A}^{1}$ are the output of the hidden layer after applying the activation function. ${W}^{2}$ are the weights of the connections between the hidden layer and the output layer. ${B}^{2}$ is the bias terms for the output layer. $\widehat{Y}$ is the predicted output after applying the final activation function. Feed-forward neural networks can have multiple hidden layers (creating deep neural networks) and different activation functions, allowing them to model complex relationships in data. The choice of hyperparameters, such as the number of hidden layers, the number of neurons in each layer, and the activation functions, is critical in designing an effective FNN.

Source link