Machine learning-driven multi-targeted drug discovery in colon cancer using biomarker signatures

Research intends to improve drug response prediction, enhance biomarker identification, and personalize treatment regimens. The Gene Expression (GE) data is gathered for the multi-targeted drug discovery in CC research. Data pre-processing procedures, such as Robust Multi-array Average (RMA) and Microarray Suite (MAS) approaches, are performed. The ClusterProfiler instrument is used for the functional enrichment analysis performance. The PPI system is employed to provide more information on the functional connections of DEG. Validation and survival analysis for the hub gene are determined. The performance of the ABF-CatBoost method is explained more comprehensively. Figure 13 shows the methodological framework.

Data collection

The GE data is gathered from the open-source Kaggle [46]. This dataset, acquired via the ColoCare Project, uses Illumina Human HT12v4 gene chips to include the GE profiles of 117 mucosa tissues and 77 tumor tissues. It also includes 107 tumor and 108 mucosa samples with matching DNA methylation profiles, combining data from GSE1017764. Using these combined datasets, the intricate relationship between DNA methylation and GE was examined.

Data preprocessing

The GE dataset was pre-processed to ensure accurate DEG analysis is carried out on GE information to discover genes that are significantly related to CC. The analysis is carried out in an R environment utilizing Bio conductor tools. These tools make it easier to pre-process, adjust for background, and statistically evaluate GE planes in tumor and typical tissue methods.

To verify the consistency and stability of expression values, data is normalized using the RMA and MAS approaches. Among them, MAS-normalized data are chosen for downstream analysis because their values are closer to the median distribution. These criteria provided only genes with significant expression variations and physiologically relevant fold changes between malignant and normal tissues that were selected. The identified DEGs provide a platform for future functional enrichment analysis and biomarker development in CC.

Functional enrichment analysis

The ClusterProfiler tool in R is used to perform functional enrichment analysis on the marker genes to investigate how these genes influence the development of CC¹⁸. (Version 3.12.0). The purpose of the R package Cluster Profiler is to compare organic themes amid gene bunches, including those found in DO, KEGG, and GO.

It is utilized to investigate the genetic significance of the discovered DEGs and their role in CC pathogenesis. The KEGG database is used to uncover overrepresented biological pathways and functional categories. KEGG pathway analysis sheds light on the molecular interaction and reaction networks in which DEGs play essential roles. A route is judged significantly enriched with a p-value < 0.05. The research assisted in identifying critical organic pathways and signaling cascades that are involved in CC progression. This strategy gave important insights into the functional roles of DEGs and prompted the recognition of possible beneficial objects and biomarkers for CC treatment.

Protein-Protein Interaction (PPI)

A PPI system is built to better understand the functional connections of DEG. The detected DEGs are added to the STRING database, which combines known and predicted protein-protein exchanges from diverse sources. The interface confidence obtained is set to average or high to ensure that biologically significant connections are covered. The PPI system is studied to recognize hub genes, which have the highest levels of interaction within the network. Centrality metrics such as degree, betweenness, and closeness are used to identify essential proteins that offer significant functions in the molecular pathways of CC. These hub genes are then chosen for additional functional testing and biomarker confirmation. The discovered hub genes are Tumor Protein p53 (TP53), Kirsten Rat Sarcoma Viral Oncogene Homolog (KRAS), and Cyclin A2 (CCNA2), which are important targets in Cethology.

Hub gene validation and survival analysis

The discovered hub genes are validated using Gephi clustering that underwent additional enrichment analysis with Enrichr. The predictive importance of the hub genes is determined by utilizing the Gene Expression Profiling Interactive Analysis (GEPIA) technology, which compares appearance stages in tumors and typical tissues. Furthermore, OS and Disease-Free Survival (DFS) analyses are executed to measure the analytical consequence of the discovered biomarkers in individuals with CC.

Classification using Adaptive Bacterial Foraging optimization–CatBoost (ABF-CatBoost) method

The research introduces the ABF-CatBoost model to enhance predictive accuracy in biomarker identification and drug response classification. The CatBoost technique, a gradient-boosting decision tree model tailored for categorical data, is then used to identify patients according to their molecular profiles. The ABF technique is utilized to optimize important hyper-parameters by simulating adaptive chemotactic behavior, which narrows the search space and enhances model performance. CatBoost effectively manages categorical data, minimizes overfitting, and produces strong predictions on intricate, high-dimensional biomedical datasets. ABF flexibly adjusts to intricate search spaces, accelerating convergence and preventing local minima in optimization problems.

CatBoost

The CatBoost was selected due to its exceptional capacity to manage high-dimensional and categorical information without requiring a lot of preprocessing. Its application of ordered boosting reduces overfitting and enhances generalization in two crucial aspects of complicated biomedical datasets. Using the CatBoost-ABF framework, patients were categorized according to molecular profiles, key biomarkers were found, and medication efficacy, harmful effects, and metabolism routes were predicted. Figure 14 represents the architecture of CatBoost.

The validated data is classified using CatBoost. The CatBoost algorithm improves the categorical feature handling and incorporates advanced regularization methods. It features better hyper-parameter tuning, increased computational efficiency, and improved handling of imbalanced data. These enhancements lead to more accurate, stable, and efficient models, making CatBoost a robust choice for complex datasets. Incorporate domain-specific features $x$ related to thermal properties. For example, when predicting thermal resistance $R$ in a material, include features like thermal conductivity, material thickness, and environmental conditions as given in Eq. (2).

$${x}_{{new}}=f\left({x}_{i{,}}\,gene\,expression{,}\,mutation\,profile{,}\,protein\,interaction\,score\right)$$

(2)

Where $f$ represents a function combining these features to capture their interactions, in this context, the loss function is the Mean Squared Error (MSE), calculated using Eq. (3).

$${MSE}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}$$

(3)

Where ${y}_{{iis}}$ the true rate ŷ_i i are the forecast values and $n$ is the number of nodes. Apply regularization to avoid overfitting and enhance model generalization. Incorporating loss regularization into the loss function is given in Eq. (4).

$${{Loss}}_{{regularized}}={Loss}\left(y,\hat{y}\right)=\lambda \mathop{\sum }\limits_{j=1}^{p}{w}_{j}^{2}$$

(4)

Here, $\lambda$ is the regularization parameter, $p$ is the sum of features y is the regularization parameter and $w$ is the model weights. This penalizes large weights and helps to prevent overfitting. Utilize adaptive learning rates to adjust the learning $\eta$ based on iterations given in Eq. (5).

$$\eta t+1=\eta t\times schedule\,
(5)

Here η is the learning rate at iteration $t$ and ${schedule}$
(6)

$\vartheta$ is the partial derivative and ${x}_{j}$ denotes the ${j}^{{th}}$ feature. Feature prediction helps in reducing dimensionality and focusing on the most impactful features for thermal resistance predictions. This development helps to fine-tune the methods of training procedures. Furthermore, it incorporates superior optimizing techniques to improve the prediction ability to identify the most valid biomarker identification. These comprehensive improvements to the CatBoost algorithm enable a robust and highly accurate method for predicting drug responses, fit for multifaceted material and engineering applications.

CatBoost is a high-performance ML algorithm based on gradient-boosting decision trees, specifically optimized for handling categorical features without extensive preprocessing. It incorporates ordered boosting and advanced regularization techniques to reduce overfitting and improve generalization. In this research, CatBoost is used to classify CC patients by analyzing complex molecular profiles, enhancing biomarker identification, and predicting drug response. It demonstrates strong capabilities in managing high-dimensional, imbalanced datasets and yields robust, accurate predictions. CatBoost’s integration with the Adaptive Bacterial Foraging optimization further refines its hyperparameters, improving method precision and reliability for multi-targeted therapy applications in CC treatment. CatBoost ensures reliable drug response prediction in tailored cancer treatment, improves classification accuracy, and manages categorical and unbalanced data efficiently.

Adaptive Bacterial Foraging (ABF) optimization

ABF effectively explores complicated biomarker and drug-response search spaces, optimizing feature selection and model parameters to increase prediction accuracy. ABF is an optimization technique that uses adaptive strategies to dynamically modify parameters, enhancing convergence speed and solution accuracy in complicated, multidimensional search spaces. It was inspired by the foraging behavior of bacteria. The classified data is optimized using the ABF algorithm. The BFO algorithm is a unique bionic-like optimization algorithm designed to replicate how Escherichia coli forages for foodstuff in the person’s colon. The BFO’s universal method to tackling an optimization crisis comprises first constructing a unique populace of nominee elucidations, determining the value of the robustness occupation, and then optimizing through society interface. In the BFO model, the strength assessment of the valuation task, which characterizes the bacterium’s position in the exploration gap, communicates the explanation of the optimization issue. The BFO algorithm consists of three steps: chemotaxis, reproduction, and elimination/dispersal. The method, however, has certain disadvantages: (1) The BFO algorithm has a high global optimization-seeking capability, but its period is deliberate. (2) The BFO approach has low constancy. (3) The BFO method’s duplication procedure generates a vice-population equal to the relative populace. The BFO method is enhanced to overcome the aforementioned flaws. The ABF algorithm is computed using the following steps:

The chemotaxis method is optimized using an adaptive step-size approach. Large step sizes are utilized early in the algorithm to enable speedy inclusive optimization explorations. As the process progresses, movement sizes are compacted to improve the accuracy of the algorithm. Equation (7) defines the tailored dynamic movement size.

$${B}_{c}=\frac{{B}_{max}}{i.\,l.k}.E.{rand}$$

(7)

In chemotaxis, ${B}_{c}$ represents the bacteria’s step size in each dimension. The initial step in the ${c}^{{th}}$ dimension is ${B}_{\max }=({limit}-{limit})$. The scaling feature, $E$, obtains the worth of an indiscriminate integer between 0 and 0.5. Where $i,l,{and}$ k denote the amount of active chemotaxis, reproduction, elimination, and dispersion procedures, respectively. rand generates a random number from the interval [0, 1]. If the microbes are close to the finest site early in the procedure, a haphazard purpose reduces the action size, preventing the bacteria’s search process from missing it. The notion of Particle Swarm Optimization (PSO) is lent to improve the hustle and capability of ABF’s algorithm exploration. Equation (8) improves the vector ϕ (j,i+1) accountable for reducing in chemotactic by comparing it to the global optimum ${H}_{{best}}$.

$$\begin{array}{l}\phi \left(j,i+1\right)=z\cdot \phi \left(j\right)+{b}_{1}\cdot rand{\cdot }\left({H}_{best}-{O}_{current}\right)\\ \qquad\qquad\qquad+\,{b}_{2}\cdot rand{\cdot }\left({O}_{best}-{O}_{current}\right)\end{array}\,$$

(8)

Where $\phi \left(j\right)$ represents the bacterium’s current chemotactic direction vector, $z$ denotes the inertia weight, which controls the influence of the previous direction (commonly set around 0.9), ${b}_{1}$ represents the cognitive coefficient that influences the appeal toward the worldwide best position H_best, and ${O}_{{best}}$. Variable ${b}_{2}$ denotes the communal coefficient that influences the attraction toward the personal best position O_best. The weight z is dynamically adjusted to balance the inertia of the previous direction and the influence of the current best solutions. The learning rates ${b}_{1}$ and ${b}_{2}$ represent the cognitive and social components, respectively. ${b}_{1}$ reflects the bacterium’s tendency to return to its best-found position, while ${b}_{2}$ steers it toward the global best position found by the swarm.

The roulette controls approach is employed to pick personal relatives to increase populace variety. To calculate $\rho y\left(j\right)$, use the formula ${fit}({y}_{j})/{\sum }_{i=1}^{t}{fit}({y}_{i})$. Calculate the increasing likelihood of every person, $\rho {by}\left(j\right)={\sum }_{i=1}^{j}\rho y\left(j\right)\left(i=\mathrm{1,1},\ldots ..t\right)$. Ultimately, generate a haphazard integer with a consistent allocation in the diversity [0,1]. If rand $\le \rho {by}\left(j\right),{y}_{j}$ is chosen. If $\rho b{(y}_{j-1})\le$ rand$\le \rho b\left({y}_{j}\right),{y}_{j}$ is chosen. Repeat stages $t/2$ to obtain relative persons for the intersect procedure. The intersect equation is illustrated in Eq. (9).

$$y\left(j\right)=\rho {y}_{{best}}+\left(1-\rho \right)y$$

(9)

Where $\left(1-\rho \right)$ $y\left(j\right)$ signifies the novel location of bacteria after hybridization, $\rho$ is the indiscriminate integer of the period on [0,1], ${y}_{{best}}$ is the location of the relative, and $y$ is the beginning location of the premature bacterium $j$. This uses the dynamic movement prospect ${\rho }_{{dc}}^{* }$ to guide bacteria’s eradication and dispersal. This boosts bacteria’s ability to connect in an inclusive exploration for the finest solutions, reduces the possibility of them entering limited-finest solutions, and assures that the algorithm converges swiftly while enhancing inhabitants’ variety. Equation (10) shows the higher eradication and dispersion perspective.

$${\rho }_{{dc}}^{* }=\frac{{I}_{{health}}^{j}-{I}_{{health}}^{{first}}}{{I}_{{health}}^{{last}}-{I}_{{health}}^{{first}}}\cdot {\rho }_{{dc}}$$

(10)

${I}_{health}^{j}={\sum }_{i=1}^{Mb}I(j,i,l,k)$ is a robustness occupation that measures the potency of bacteria ${j}^{{th}}$ foraging capacity. It is articulated as the amount of the suitability values of all bacteria j positions after ${Mb}$ chemotaxis activities. ${I}_{{health}}^{{first}}$ and ${I}_{{health}}^{{last}}$ reflect the well-being of the character with the major and least values in the inhabitants, respectively. ${\rho }_{{dc}}$ represents the likelihood of initial suppression and dissemination. It improves parameter tuning efficiency, predictive accuracy, and biomarker identification by imitating adaptive search behavior, making it excellent for optimizing difficult classification tasks in CC analysis. The ABF greatly increases the accuracy, resilience, and flexibility of methods used in complicated biological data processing by optimizing biomarker selection and parameter tweaking.

The ABF-CatBoost approach has major advantages since it combines efficient hyper-parameter tweaking with strong classification capabilities. ABF improves parameter optimization, which increases model precision and resilience. CatBoost successfully handles categorical and multidimensional data while limiting overfitting via ordered boosting. This integration allows for precise multi-target prediction and enhanced therapy outcome classification in CC. The model is computationally efficient, scalable, and adaptable to a variety of datasets, making it ideal for precision medicine. It also tackles data asymmetry, manages noisy features, and facilitates individualized treatment plans by utilizing patient-specific molecular profiles. Algorithm 1 shows the ABF-CatBoost method.

Algorithm 1

Adaptive Bacterial Foraging optimization–CatBoost (ABF-CatBoost) method

Start

Step 1: Initialize CatBoost Model:

– Set initial parameters: learning rate η₀, depth d,L2 regularization λ,n estimators

– Define loss function: Loss=MSE (y,ŷ)+λ×∑(w²)

Step 2: Initialize ABFO Parameters:

– Set population size $N$, max chemotaxis steps $C$, reproduction steps $R$, elimination/dispersal steps $E$

– Set adaptive step size $B\max$, learning rate η₀, and PSO coefficients z,b₁,b₂

Step 3: ABFO Optimization Loop:

For each generation:

For each bacterium:

– Chemotaxis Phase:

• Compute adaptive step: ${B}_{c}=\frac{{B}_{\max }}{i.l.k}.E.{rand}$

• Update position: $\phi \left(j,i+1\right)=z\cdot \phi \left(j\right)+{b}_{1}\cdot {rand}{\cdot }\left({H}_{{best}}-{O}_{{current}}\right)$

• Evaluate fitness using CatBoost loss with current parameters

– Reproduction Phase:

• Select the top 50% of bacteria and clone them

– Elimination–Dispersal Phase:

• Calculate dispersal probability ${\rho }_{{dc}}^{* }$ using normalized fitness

• Replace low-fitness bacteria randomly

Step 4: Final Model Training:

– Select the best hyper-parameter set from ABFO

– Train the final CatBoost model on training data using optimal parameters

– Apply adaptive learning rate $\eta

Step 5: Method assessment:

– Forecast on test set utilizing the final CatBoost algorithm

– Compute performance metrics

– Compute feature importance: \(\partial {Loss}/\partial {x\_j}$

Step 6: Output: Final predicted outcome and feature importance

The CatBoost-ABF approach combines feature optimization with classification; ABF finds the best parameters and most pertinent biomarkers, while CatBoost uses molecular data to classify patient subgroups and forecast treatment response. When combined, they allow for multi-targeted, accurate, and customized design in CC. The ABF-CatBoost approach improves predictive performance by effectively choosing pertinent biomarkers and adjusting method parameters, which results in better drug response prediction accuracy, sensitivity, and specificity. In addition to supporting individualized, multi-targeted cancer treatments with strong generalizability across datasets, it tackles drug resistance by modeling adaptive processes. ABR-CatBoost combines powerful categorization with adaptive optimization to enhance biomarker selection, patient stratification, and accurate drug response prediction in CC. ABF-CatBoost overcomes the shortcomings of traditional drug discovery methods, improves prediction accuracy, effectively handles complex data, and adjusts the optimization.