Once the features are extracted, the colon cancer classification is performed using machine learning and metaheuristic classifiers. In our research, we utilize a diverse set of seven classifiers to cast a wide net for the most suitable approach. The machine learning classifiers utilised are GMM, DFA, NBC, and SVM (RBF). These are probabilistic modelling used to identify clusters of similar gene expressions and patterns within the data. Further to uncover the potentially hidden dynamics inside the data, we employ the metaheuristic integrated machine learning classification namely PSO-GMM, Firefly-GMM, and FPO-GMM. By utilizing a diverse range of algorithms, we aim to capture the multifaceted nature of the data and identify the most accurate and robust approach for colon cancer classification.
Gaussian mixture model (GMM)
The Gaussian Mixture Model is a popular unsupervised learning technique that groups together similar data points to perform tasks like image classification and pattern recognition. A set of Gaussian distributions are joined linearly in the PDF (Probability Density Function) of GMM, which makes the classification of the data easier. So the mixture classifier creates a probability distribution from microarray gene expression levels for both the classes with the help of a combination of Gaussian distributions. After probability distribution, the class prediction is based on the highest probability value (Bayes’ theorem). For every class ‘c’, and feature ‘f’, GMM assumes that the data is drawn from a mixture of ‘G’ Gaussian coefficients. This can be expressed in the following way:
$$P\left(f\right|y=c)= \sum_{g-1}^{G}{M}_{cg}N (f |{ \Upsilon}_{cg} , { \text{C}\cdot }_{cg})$$
(8)
Here \({\text{M}}_{cg}\), \({\Upsilon}_{cg}, { \text{C}\cdot }_{cg}\) indicates coefficient of mixture, mean vector, and covariance matrix respectively, under mixture component ‘g’ in class ‘c’. The \({\text{M}}_{cg}\) signify the ratio of each constituent in the class. From training data, \({\text{M}}_{cg}\), \({\Upsilon}_{cg}, { \text{C}\cdot }_{cg}\) parameters are gained for each constituent in the class with ‘N’ as the total number of samples, total classes ‘C’, and i = 0,1,2,…N.
$${\Upsilon}_{c} = \frac{1}{{N}_{c}}\sum_{i=1}^{N}{{(y}_{i}=c)}_{1}. {f}_{i}$$
(9)
$$\text{C}\cdot =\frac{1}{N-G}\sum_{c=1}^{C}\sum_{i=1}^{N}{{(y}_{i}=c)}_{1} ({f}_{i} – {\Upsilon}_{c}){({f}_{i} – {\Upsilon}_{c})}^{T}$$
(10)
Also, \(P (y=c)\) indicating prior probability is also obtained for every class ‘c’. \(P (y=c)\) is given as:
$$p \left(y=c\right)= \frac{Number \; of \;samples \; with \;class \;c}{Total \;number \;of \;samples}$$
(11)
A new feature, \({f}_{new}\) classification is performed by likelihood estimation under mixture model using Bayes’ theorem in the following way.
$$p\left(y=c|{f}_{new}\right)\propto p \left(y=c\right)\cdot p\left({f}_{new}|y=c\right)$$
(12)
$$p\left({f}_{new}|y=c\right) = \frac{1}{{\left(2\pi \right)}^{D/2}{\left|\text{C}\cdot \right|}^{1/2}}\text{ exp }\left(-\frac{1}{2}{\left({f}_{new}- {\Upsilon}_{k}\right)}^{T}{\text{C}\cdot }^{-1}({f}_{new}- {\Upsilon}_{k})\right)$$
(13)
Here ‘D’ represents the dimensionality of the data, that is nothing but the number of features.
Particle swarm optimization-GMM (PSO-GMM)
PSO utilizes the collective intelligence of a “swarm” to optimize the classification performance. Consider a flock of birds searching for food, constantly refining their flight paths based on individual discoveries and interactions. PSO can effectively perform classification by searching for the most clustered and hidden subset of genes from the large pool of gene expression data. Moreover, PSO can handle the complex and non-linear relationships among feature-extracted gene expressions. A combination of PSO with GMM (PSO-GMM) can intricate relationships that might be missed by static GMM. The technique potentially provides a more accurate cluster identification and classification. Here’s the crux of PSO-GMM classification, Nair et al.42.
$${v}_{i}(t+1) = w{v}_{i}
(14)
where \({v}_{i}
(15)
where \({I}_{i}\) is the attractiveness, \({{I}_{i}}^{0}\) is the initial attractiveness, γ is the light absorption coefficient, \({r}_{ij}\) is the distance between fireflies i and firefly j. Fireflies move towards brighter neighbours, gradually refining their positions towards better solutions. The movement of a firefly i towards a brighter firefly j is determined by the attractiveness and randomness given by
$${{x}_{i}}^{t+1}= {{x}_{i}}^{t}+ \beta \left({{I}_{j}}^{t}- {{I}_{i}}^{t}\right)+ \alpha (rand\left( \right)-0.5)$$
(16)
Here, \({{x}_{i}}^{t+1}\) is the new position of firefly i at time step t + 1, \({{x}_{i}}^{t}\) is the current position of firefly i at time step t, β is the attraction coefficient, α is the randomization parameter, and \(rand\left(\right)\) ∈ [0, 1], is a generated random number.
The collaboration of Firefly with GMM can potentially uncover intricate relationships in gene expression data that might be missed by standard GMM. Firefly combined with GMM can also overcome the limitations of the dimensionality problem. The Firefly GMM thus gives a clustered and segregated data distribution to GMM for classification by varying the mean, variance and other statistical parameters of the extracted features.
Detrended fluctuation analysis (DFA)
Detrended fluctuation analysis (DFA) efficiently solves problems involving complex and non-stationary data. DFA captures both short and long-range correlations based on how the data fluctuates around the data distribution with the help of a scaling exponent to classify the data. The scaling nature of DFA algorithm is described by the root mean square fluctuation of the integrated time series and detrended input data. For DFA, the inputs are analysed in the following way.
$$y\left(k\right)={\sum }_{i=1}^{k}[B\left(i\right)-\overline{B }]$$
(17)
where \(B\left(i\right)\) and \(\overline{B }\) are the ith sample of the input data and mean value of the input data respectively, thus \(y\left(k\right)\) denotes the estimated value of the integrated time series. Now, the fluctuation of integrated time-series and detrended data for a window with the scale of ‘n’ is determined by
$$F\left(n\right)=\sqrt{\frac{1}{N}{\sum }_{k=1}^{N}[y\left(k\right)-{y}_{n}(k){]}^{2}}$$
(18)
Here, \({y}_{n}(k)\) is the kth point on the trend computed using the predetermined window scale, and ‘N’ is the normalization factor.
Therefore using Detrended Fluctuation Analysis (DFA) as a classifier for microarray gene expression data can effectively capture the long-range correlations present in microarray gene expression data. Unlike traditional methods focusing on short-range correlations, DFA evaluates the scaling behaviour of fluctuations across different time scales. This capability is particularly valuable in gene expression data analysis, where genes may exhibit complex patterns of co-regulation and interactions across various biological processes and time scales.
Naive Bayes classifier (NBC)
NBC is a probabilistic classification algorithm based on the Bayes theorem and feature independence assumption. This straightforward assumption allows Naive Bayes classifiers to efficiently handle large feature spaces, making them computationally efficient and scalable for microarray data analysis. NBC starts by calculating the posterior probability of a class using the prior probability and likelihood. For a given class C and extracted features x1, x2… xn, posterior probability \(p\left(C|{x}_{1},{x}_{2},\dots ., {x}_{n}\right)\) is expressed in Fan et al.43 as:
$$p\left(C|{x}_{1},{x}_{2},\dots ., {x}_{n}\right)= \frac{p\left(C\right).p({x}_{1},{x}_{2},\dots ., {x}_{n}|C)}{p({x}_{1},{x}_{2},\dots ., {x}_{n})}$$
(19)
For class C, \(p\left(C\right)\) represents the prior probability, \(p({x}_{1},{x}_{2},\dots ., {x}_{n}|C)\) is the likelihood, and \(p({x}_{1},{x}_{2},\dots ., {x}_{n})\) is the evidence probability. As mentioned, in the Naive Bayes approach, the features are conditionally independent for the class. This assumption simplifies the calculation of the likelihood as follows:
$$p({x}_{1},{x}_{2},\dots ., {x}_{n}|C) = p({x}_{1}|C) . p({x}_{2}|C) \dots p({x}_{n}|C)$$
(20)
where \(p({x}_{i}|C)\) is the probability of feature \({x}_{i}\) of the class C. The \(({x}_{i}|C)\) is estimated from the fraction of class C training examples with the feature value \({x}_{i}\). Then the prior probability (C) can be estimated as the fraction of training examples belonging to class C. Finally, to predict the class label for the features \({x}_{i}\), the algorithm calculates the posterior probability for each class and assigns the instance to the class with the highest probability. Thus using a Naive Bayes classifier for microarray gene expression data can efficiently handle large amount of data because they assume independence between features given the class label, which greatly reduces computational complexity.
Support vector machine (radial basis function)
Support Vector Machine with a Radial Basis Function (SVM RBF) can handle complex, non-linear relationships between gene expression levels. SVM can effectively handle non-linear separability by mapping the input data into a high-dimensional feature space, where non-linear relationships become linearly separable. This is possible by constructing decision boundaries between normal and cancerous samples by handling non-linear relationships effectively. RBF is the kernel used in SVM to perform the nonlinear mapping of the input features into a higher-dimensional. The RBF kernel \({K}_{RBF}\left(x,z\right)\) that is used to compute the similarity between feature vectors in the input space is given by:
$${K}_{RBF}\left(x,z\right)=\text{exp}\left(-\frac{{\Vert x-z\Vert }^{2}}{2{\sigma }^{2}}\right)$$
(21)
where σ is the kernel width parameter that regulates the influence of each training sample, and |x − z| is the Euclidean distance between feature extracted vectors ‘x’ and ‘z’. The SVM RBF classification decision function is also described as a linear combination of the kernel evaluations between the input feature vector and the support vectors with the bias term, as performed by Zhang et al. 48. The decision function \({f}_{RBF}\left(x\right)\) is given by:
$${f}_{RBF}\left(x\right)=\sum_{i=1}^{N}({\alpha }_{i}{y}_{i}){K}_{RBF}({x}_{i, }x)+b$$
(22)
Here \({y}_{i},\) and \({\alpha }_{i}\) are Lagrange multipliers and class labels respectively associated with each support vector. The \({K}_{RBF}\left(x,z\right)\) computes the computes the similarity or distance between two feature-extracted vectors \({x}_{i}\) and \(x\). \(b\) is the bias term that shifts the decision boundary away from the origin allowing the SVM to classify data points that may not be separable by a hyperplane in the original feature space.
Flower pollination optimization (FPO) with GMM
FPO is a metaheuristic optimization algorithm inspired by the pollination behaviour of flowering plants. In nature, pollination happens through two major forms: abiotic and biotic. The biotic cross-pollination is observed in 90% of the cases where the insects are supporting the pollination process. The insects thus take long distance steps, in which the motion can be drawn from a levy distribution. The FPO starts with the discovery of solution space and subsequent movements.The insect motion and the levy distribution are expressed in the following steps as provided in Yang et al. 49.
$${{x}_{i}}^{t+1}= {{x}_{i}}^{t}+ \delta L \left(\lambda \right) ({g}_{best}- {{x}_{i}}^{t})$$
(23)
where \({{x}_{i}}^{t}\) is a pollen representing the solution vector \({x}_{i}\) at the \(t\)th iteration, and \({g}_{best}\) is the global best solution discovered so far. The step size is denoted by factor, and \(L \left(\lambda \right)\) is the Levy flight step size denoting the success of pollination.
The Levy distribution is given by:
$$L \sim \frac{\lambda \Gamma\left(\lambda \right)\text{sin}\left(\pi \lambda /2\right)}{\pi } \frac{1}{{S}^{1+\lambda }}, ({s} \gg {s}_{0} >0)$$
(24)
Where \(\Gamma\left(\lambda \right)\) the step size for is large steps where \(({s}>0)\), drawn from a gamma distribution. However, for smaller pseudo-random steps, that correctly follow Levy distribution, the \(s\) value is drawn from two Gaussian distributions U and V (Mantegna algorithm 50),
$$s= \frac{U}{{\left|V\right|}^{\left(1/\lambda \right)}}, U \sim N \sim N \left(0,{\sigma }^{2}\right), V \sim N (\text{0,1})$$
(25)
Here, \(U\) is drawn from a normal distribution with mean = 0 and variance = \({\sigma }^{2}\), and \(V\) is drawn from a normal distribution with mean = 0 and variance = 1. The variance is given by
$${\sigma }^{2}= {\left\{ \begin{array}{ccc}\frac{\Gamma (1+\lambda )}{\lambda \Gamma [(1+\lambda )/2] }& .& \frac{\text{sin}\left(\pi \lambda /2\right)}{{2}^{(\lambda -1)/2}}\end{array}\right\}}^{1/\lambda }$$
(26)
The remaining 10% of the abiotic pollination observed in the plant and flower community is regarded as a Random Walk, that sometimes brings out solutions from the unexplored search space \({x}_{i}\).
$${{x}_{i}}^{t+1}= {{x}_{i}}^{t}+ \varepsilon ({{x}_{j}}^{t}- {{x}_{k}}^{t})$$
(27)
Here, \( {{x}_{i}}^{t}\), and \({{x}_{k}}^{t}\) are considered to be pollen transformations within the same plant species, from the same population. \(\varepsilon\) is the step for the random walk drawn from the uniform distribution [0,1].
Thus, FPO allows for a global search of the solution space, which is crucial for effectively exploring the high-dimensional feature space of gene expression data. FPA identifies the most relevant genes that can contribute to the classification task and reduces the dimensionality of the feature extracted data. FPA also changes the means, covariance, and mixing coefficients of data distribution, so that GMM can perform better with the optimized data that contains complex and non-linear relationships among genes. The next section discusses the selection of classifier targets for the Adeno and Meso classes.
Selection of target
Selecting the target is a crucial step in defining the objective of the classification methodology. The clear definition of target improves the classifier model’s predictive power by focusing on the most informative aspects of the data. The noise, outliers, and nonlinearity aspects of the data decide the selection of classifier targets. Moreover, the dataset used in this research is imbalanced and selecting the target must be strategized to deliver the maximum classifier performance. The binary classification performed in this research where the classification of feature extracted microarray data samples are classified into normal and colon cancer classes. Therefore two targets are selected namely \({\text{T}}_{{\varvec{N}}}\) and \({\text{T}}_{{\varvec{C}}}\). For a normal class feature set of N elements, the class target for the Normal class is defined as:
$$\frac{1}{N} \sum_{i=1}^{N}{M}_{i} \le {T}_{N}$$
(28)
where \({M}_{i}\) is the average of the feature extracted vectors of the Normal class, and \({T}_{{\varvec{N}}}\) target follows the constraint \({T}_{{\varvec{N}}}\)∈ [0, 1]. For a Colon Cancer class feature set of M elements, the class target for the Colon Cancer class is defined as:
$$\frac{1}{M} \sum_{j=1}^{N}{M}_{j} \le {T}_{C}$$
(29)
where \({M}_{i}\) is the average of the feature extracted vectors of Colon cancer class, and \({T}_{C}\) target follows the constraint \({T}_{C}\) ∈ [0, 1]. The Eucledian distance between the Targets of the binary class must also follow the constraint:
$$\Vert {\text{T}}_{N} – {\text{T}}_{C} \Vert \ge 0.5 $$
(30)
Based on the Eqs. (28–30), the class targets \({T}_{N}\) and \({T}_{C}\) are chosen as 0.85 and 0.1, respectively. The classifier performance will be monitored with the help of MSE criteria. The next section discusses the training and testing of classifiers.
Training and testing of classifiers
Before moving to the final classification step, training of classifiers is performed utilizing the labelled microarray dataset to adjust the classifier model’s parameters. This step enables the learning of complex patterns and relationships in the gene expression data and optimizes the classifier’s performance by minimizing the discrepancy between predicted and actual outcomes. After training, testing evaluates the trained classifier’s performance on an independent dataset not used during training. So the testing phase assesses the model’s generalization ability and provides insights into its effectiveness on other datasets. Mean Square Error (MSE) serves as a critical evaluation metric during both the training and testing phases. In training, MSE quantifies the disparity between predicted and actual gene expression values, guiding the optimization process to minimize prediction errors. During testing, MSE provides insights into the model’s predictive accuracy and its ability to generalize to unseen data. Minimizing MSE ensures that the classifier effectively captures the underlying relationships within the gene expression data, enhancing its predictive performance. The MSE is given by calculating the average squared difference between the predicted values and the actual values in a dataset.
$$MSE= \frac{1}{N}{\sum }_{j=1}^{N}({A}_{j}-{P}_{j}{)}^{2}$$
(31)
Here ‘N’ is the total number of extracted features, \({A}_{j}\) represents the actual gene expression value of jth instance, and \({P}_{j}\) represents the predicted gene expression value of the jth instance. A lower MSE indicates better performance, as it signifies smaller discrepancies between predicted and actual values. Conversely, a higher MSE suggests poorer performance and potentially larger prediction errors. Thus, MSE acts as a parameter that continuously checks the performance of the classifier.
We also perform K-Fold Cross Validation as performed in Fushiki et al.44 to validate the classifier model. In this approach, the dataset is divided into K subsets (folds), with each fold serving as a validation set while the remaining data is used for training. This process is repeated K times, ensuring that each data point is used for validation exactly once. By averaging the performance metrics across multiple validation iterations, K-Fold Cross Validation provides a more reliable estimate of the classifier’s performance and helps mitigate overfitting, thus enhancing the generalization ability of the model. In this research, we have varied K value in the range 5–20, and is finally the value is fixed with 10, as higher values are providing similar results.
In this research for the binary classification problem of colon cancer data, the confusion matrix is described as shown in Table 4. It is used to describe the performance of a classification model during the training and testing phase. The confusion matrix has four possible combinations of predicted and actual class labels: True Positives (TP), True Negatives (TN), False Positives (FP), and False negatives (FN). These values help to evaluate the overall classifier performance metrics like Accuracy, F1 Score, Error Rate, MCC, and Kappa.
Table 5 shows the obtained training and testing MSE of classifier for various feature extraction methods. The training MSE is reported between 10−01 and 10−9. The testing MSE is reported between 10−5 to 10−8.The training process is evaluated with 2000 iterations. The FPO-GMM with STFT feature extraction attained the lowest training and testing MSE of 7.29 × 10–09 and 6.44 × 10–07, respectively. For LASSO feature extraction, SVM (RBF) attained the lowest training and testing MSE of 1.6 × 10–07 and 9 × 10–08, respectively. For EHO feature extraction SVM (RBF) classifier attained the lowest training MSE of 1.67 × 10–07 and the FPO-GMM classifier attained the lowest testing MSE of 9 × 10–08.
Based on the observations from Table 5, the classifier parameters are selected for the various employed classifiers in this research as provided in Table 6.