This section outlines the comprehensive approach employed to model and predict the release behavior of the target variable using a dataset comprising 155 data points as reported previously7,12. The dataset can be accessed from the GitHub link: (https://github.com/y-babdalla/coating_release)7.
These data have been also used by AlOmari and Almansour12 as well as Yadav et al.13 to build machine learning regressive models. The methodology which is visualized in Fig. 1 integrates data preprocessing, feature reduction, and the application of advanced ML models7,12,13. The data is collected from a targeted colonic delivery formulation including 5-aminosalicylic acid (5-ASA) coated with polysaccharide-based materials. The data were collected from previous publications, and further details regarding the measurement can be found elsewhere7,14. Raman spectroscopy was used for API concentration detection and determination using a Renishaw InVia Raman spectrophotometer at 785 nm with an acquisition time of 120 s7. Machine learning models along with preprocessing were carried out in this work on the drug release data to predict the release of 5-ASA at three points of 2, 8, and 24 h. So, the drug release is the sole output to be predicted via ML models.
Initially, the dataset undergoes preprocessing steps including dimensionality reduction through Principal Component Analysis (PCA) to address high-dimensionality issues related to the Raman spectra collected on the samples. Key input variables are encoded using Leave-One-Out (LOO) encoding for categorical features, followed by outlier removal using the Isolation Forest technique. Normalization is applied via Min-Max scaling to ensure uniformity in feature scales. The processed dataset is subsequently used to train machine learning models, namely KRR, K-ELM, and QR. To enhance predictive accuracy, hyperparameter optimization is carried out using the Sailfish Optimizer (SFO), which systematically refines model configurations for improved performance. Evaluation of these models is done through metrics such as R², MSE, and MAE, ensuring robust comparison across training and test sets.

Overall Methodology utilized for prediction of drug release.
Data description
This study utilizes a dataset comprising 155 data points. The objective variable is denoted by “release” showing the amount of drug release at different times, which is intended to be forecasted utilizing a combination of diverse inputs. The dataset contains more than 1,500 Raman spectral features, which provide detailed information about the vibrational characteristics of the data points. It also includes a numerical variable, “time”, indicating the duration of the experiment. Furthermore, two categorical variables, “polysaccharide name” and “medium”, describe the environmental context and compositional makeup that may influence the output variable. Four different media were considered in the modeling including Control, Patient, Rat, and Dog7. This diverse set of inputs provides a comprehensive basis for modeling and analyzing the release behavior.
Pre-processing
Pre-processing is an essential phase in preparing raw datasets for machine learning applications. It ensures the data is well structured in analysis, significantly enhancing the model’s accuracy and effectiveness. During this stage, various techniques, including dimensionality reduction, outlier identification, and normalization, are applied to improve data integrity. The following sections detail the specific methods used in this process.
PCA for dimensionality reduction
PCA is frequently applied to reduce the data dimensionality while retaining a significant amount of its variance. High-dimensional data can lead to computational inefficiencies and increase the risk of overfitting. PCA addresses this issue by converting the data into a series of orthogonal components known as principal components, capturing the highest variance present in the data15. By choosing the foremost principal components, we can diminish the number of features while maintaining the fundamental patterns and trends in the dataset. This method is particularly useful in applications where datasets have many correlated variables, as PCA removes redundancy and ensures efficient representation.
The PCA process begins by standardizing the dataset, followed by the calculation of the covariance matrix of the features. Eigenvalues and eigenvectors are then calculated, with the eigenvectors indicating the directions of maximum variance. The top-k eigenvectors associated with the largest eigenvalues are chosen to create the reduced feature set. This technique not only simplifies the dataset but also reduces the computational cost of training models.
To determine the quantity of components in PCA, we applied different values and based on the cumulative explained variance of each situation, we selected 9 components for PCA process. Figure 2 illustrates the scatter plot of the initial two principal components (PCA1 and PCA2) derived from the Principal Component Analysis (PCA) conducted on the dataset. These components reveal the maximum variance in the data, with PCA1 capturing the most variance and PCA2 capturing the second most. Each point in the space corresponds to a data point in the transformed, lower-dimensional space, providing a visual representation of how the data is distributed. This plot helps identify potential clusters or patterns in the data while retaining as much information as possible in a reduced form.

First two principal component in PCA process.
Leave-one-out (LOO) encoding for categorical inputs
LOO is used for converting categorical variables into numerical values, particularly useful in models sensitive to feature scales, like regression models. This technique addresses the limitations of traditional encoding methods, such as one-hot encoding, by directly capturing the relationship between the categorical inputs and the output without significantly increasing the dimensionality16. LOO encoding calculates the mean target value for each category, excluding the current observation to prevent data leakage. This means that for each category in the dataset, the encoded value is the average of the target variable’s values associated with that category, excluding the specific row being processed. This ensures that the model is not influenced by the target value it is trying to predict during training17. Formally, for a categorical variable Ci in a dataset, the LOO encoded value for the j-th observation is calculated as:
$${\text{LOO}}_{j} = \frac{{\mathop \sum \nolimits_{{k \ne j}}^{N} y_{k} }}{{\left| {C_{i} = c_{j} } \right| – 1}}$$
Where yk represents the target variable, cj is the category value for the j-th observation, and N indicates the total number of observations. This ensures that each observation’s encoding is independent of its own target value.
LOO encoding is particularly useful in scenarios where categorical variables have a high cardinality, as it reduces the risk of overfitting while efficiently capturing the influence of each category on the output. In the context of this study, it was applied to the categorical inputs, “medium” and “polysaccharide name,” to enhance the model’s predictive power by encoding these variables in a way that reflects their impact on the “release” output.
Figure 3 presents a stacked bar chart illustrating the relationship between two categorical variables: “Polysaccharide Name” and “Medium”. The chart breaks down each medium category into the respective polysaccharide names, visually representing the distribution and frequency of each polysaccharide within different media. The stacked nature of the bars provides a clear comparison, showing how polysaccharide compositions vary across different environments, offering insights into the experimental conditions and their impact on the release behavior.

Stacked bar chart showing the distribution of polysaccharide names across different media categories.
Isolation forest for outlier removal
Isolation Forest (IF) is designed to isolate anomalies rather than modeling normal data distribution. It works by randomly partitioning the data using decision trees, where outliers are isolated more quickly compared to normal points. The logic behind Isolation Forest is that outliers are few and far from the majority of the data, thus requiring fewer splits to be isolated18,19
In pre-processing, this method is applied to detect and remove outliers that could negatively impact model performance. Isolation Forest is particularly effective for high-dimensional datasets and provides a fast, scalable solution for identifying anomalies. By eliminating outliers, the dataset becomes cleaner, improving the generalization ability of the model13.
Outlier removal was performed using the IF algorithm, with the contamination rate set to 0.058. This rate, which represents the proportion of outliers in the dataset, led to the removal of 9 datapoints identified as outliers from the original 155 datapoints. The contamination rate of 0.058 indicates that approximately 5.8% of the data was flagged as outliers, a threshold selected to balance sensitivity and dataset integrity. For comparison, a contamination rate of 0.05 would remove approximately 8 datapoints, while a rate of 0.1 would remove about 16 datapoints. In this case, the chosen rate of 0.058 resulted in the removal of 9 datapoints. To ensure that this process did not introduce bias, we examined the distribution of the removed datapoints and found them to be evenly distributed across the dataset. This balanced distribution confirms that the dataset remains representative after cleaning, preserving its integrity for subsequent analysis.
Min-max scaler for normalization
Normalization is crucial to prevent features with varying scales from unduly affecting the model. The Min-Max scaler is a simple and popular method for normalization, which transforms the data to fit within a fixed range, commonly between 0 and 1. This process rescales each feature independently based on its minimum and maximum values in the dataset20
The equation for Min-Max scaling is:
$$X^{\prime} = \frac{{X – X_{{min}} }}{{X_{{max}} – X_{{min}} }}$$
In this equation, X stands for the original value, Xmin and Xmax are the minimum and maximum values of the feature, and X’ stands for the normalized value. Min-Max normalization standardizes feature values to a common scale, ensuring that all variables have equal influence during model training.
Through this scaling, models are less sensitive to feature magnitude differences, leading to improved performance and faster training times. By applying these pre-processing techniques, the dataset becomes more manageable, and the learning algorithm can focus on capturing meaningful patterns rather than being distracted by irrelevant variance, outliers, or unscaled features13.
Sailfish optimizer (SFO)
The performance of ML algorithms is heavily influenced by hyper-parameter tuning. The selection of these parameters plays an important role in both the accuracy and generalization of such algorithms. Lately, optimization techniques inspired by nature have become increasingly popular for this purpose. One such method, the SFO, is a metaheuristic algorithm modeled after the hunting strategies and teamwork displayed by real sailfish. These remarkable creatures showcase collective intelligence by coordinating and herding schools of prey. SFO leverages this behavior to efficiently navigate the hyperparameter space in machine learning models21.
The core tenets of the SFO algorithm draw inspiration from the social behaviors observed in sailfish. Within a school of sailfish, individuals alternate between exploration and exploitation tactics, facilitating a harmonious equilibrium between exploring novel hunting areas and exploiting familiar resources. Leveraging this natural balance, the SFO algorithm tackles the intricate challenge of hyper-parameter optimization. By employing mathematical formulations, the Sailfish Optimizer dynamically adjusts the hyper-parameters in each iteration, mimicking the agile movements of sailfish. The equation that defines the position update for the i-th sailfish in D-dimensional space is as follows22:
$$X_{i}^{{t + 1}} = X_{i}^{t} + V_{i}^{{t + 1}}$$
Here, \(X_{i}^{t}\) denotes the position of the i-th sailfish at iteration t, while \(V_{i}^{{t + 1}}\) signifies the velocity vector of the i-th sailfish in the subsequent iteration.The velocity of each sailfish is influenced by various components, including its personal optimal position, the collective optimal position within the group, and the social interactions among the sailfish23. The equation for updating velocity can be stated as22:
$$V_{i}^{{t + 1}} = V_{i}^{t} + A_{i}^{{t + 1}} + C_{i}^{{t + 1}} + S_{i}^{{t + 1}}$$
In this context, \(A_{i}^{{t + 1}}\), \(C_{i}^{{t + 1}}\), and \(S_{i}^{{t + 1}}\) represent the attraction, cohesion, and separation elements that emulate the social interactions observed in sailfish.
SFO excels in hyper-parameter optimization for various machine learning algorithms. Its capacity to equilibrate exploration and exploitation while evading local optima renders it a formidable contender for hyper-parameter optimization.
SFO’s dynamic alternation between exploration—searching for superior hyper-parameter values—and exploitation—refining promising solutions—enables it to efficiently navigate the hyper-parameter space, striking an optimal balance between discovering new regions and enhancing high-performance configurations. Unlike traditional methods such as grid search, which exhaustively tests all combinations at a high computational cost, or random search, which lacks adaptive focus, SFO intelligently adjusts its search strategy based on prior evaluations. By employing K-fold cross-validation as its fitness function, SFO leverages a robust performance metric to guide this process, ensuring hyper-parameters are selected to maximize generalizability and minimize overfitting—a critical advantage in machine learning. This makes SFO particularly effective in complex, high-dimensional hyper-parameter landscapes, where conventional approaches falter due to inefficiency or dimensionality challenges. As a result, SFO offers a faster, more reliable, and adaptable alternative, delivering superior model performance with fewer evaluations than commonly used optimization techniques.
The hyperparameter tuning process entailed a K-fold cross-validation technique, with K configured to 5. In each fold, the performance of every candidate model was evaluated using common metrics including R², MSE, and MAE. The results of these evaluations were used as the fitness function in the SFO algorithm. This ensured that the optimizer was driven by robust metrics that reflected the models’ abilities to generalize well across unseen data.
In our implementation, the SFO was configured with a population size of 30 sailfish and a maximum of 120 iterations. The attack power, which controls the exploration-exploitation trade-off, was set to 0.45, and the elite sailfish ratio, which determines the proportion of top-performing sailfish retained in each iteration, was set to 0.15. These configuration values were selected based on preliminary experiments and can be adjusted depending on the specific requirements of the dataset and models. The hyperparameter search was conducted using 5-fold cross-validation to ensure the robustness and generalizability of the selected parameters.
Kernel ridge regression (KRR)
An advanced robust non-linear method called KRR builds on the ideas behind linear ridge regression. Employing the kernel trick, KRR can proficiently represent intricate, non-linear relationships within the data. KRR is a versatile and potent instrument for various applications, as it improves the model’s capacity to identify complex patterns while preserving the regularization advantages of ridge regression24. By controlling model complexity, these techniques effectively mitigate overfitting and contribute to improved generalization and predictive performance25,26.
This approach employs kernel functions to project the input data into a higher-dimensional feature space, enabling the capture of complex, nonlinear relationships. The main objective is to predict a continuous target using a set of explanatory features. Unlike traditional linear regression, KRR allows for modeling complex and non-linear relationships27. The KRR model has a mathematical form as follows28:
$$\:f\left(X\right)={\sum\:}_{i=1}^{N}{{\upalpha\:}}_{i}K\left(X,{X}_{i}\right)$$
This equation represents the target output value for an input X as f(X). Here, N signifies the size of the training dataset. Also, we denote the regression coefficients for every training data point with \(\:{{\upalpha\:}}_{i}\). Measuring the similarity between the input X and the i-th instance \(\:{X}_{i}\), the kernel function \(\:K\left(X,{X}_{i}\right)\)29.
K-ELM regression
K-ELM is a modification of the Extreme Learning Machine (ELM) aimed at enhancing its generalization capabilities through the utilization of kernel methods30. Unlike traditional ELM, which randomly selects the input weights and biases in a single-layer feedforward network, K-ELM applies the “kernel trick,” enabling it to model more complex, nonlinear relationships. This approach enables efficient management of nonlinearity, thereby enhancing the effectiveness of K-ELM for regression tasks31.
In K-ELM regression, a kernel function, including the Radial Basis Function (RBF) or polynomial kernel, is employed to assess the similarity between input samples. This approach improves the ability of ELM to generalize and handle noisy data. The result of K-ELM regression is obtained by resolving a system of linear equations utilizing the kernel matrix, which represents the interrelations among the data points in the transformed feature space. The model’s training process is optimized by the simplicity of ELM’s architecture, while the application of kernel functions improves its efficacy with intricate datasets.
K-ELM has been effectively utilized in diverse domains, such as time series prediction, system identification, and financial forecasting, where nonlinear relationships are common. The model’s efficacy is attributed to its rapid training duration, scalability, and resilient performance, even with constrained data, rendering it a formidable asset in contemporary machine learning applications.
Quantile regression (QR)
Differing from classic regression techniques that primarily estimate the conditional mean, QR provides a comprehensive insight by elucidating the impact of predictors on different quantiles within the response distribution. This allows for a more nuanced analysis, revealing how predictors influence the variable not just on average, but at various levels of its distribution32.
Consider there is a dataset containing n datapoints, where each datapoint i consists of a set of input variables \(\:{x}_{i}=\left({x}_{i1},{x}_{i2},\dots\:,{x}_{ip}\right)\) and a corresponding target variable \(\:{y}_{i}\). The goal of QR is to estimate the conditional quantile function for a specified quantile \(\:{\uptau\:}\), which is defined as a value ranging from 0 to 1. This function offers a prediction of the \(\:{\uptau\:}\)-th quantile of \(\:{y}_{i}\) conditioned on the input variables \(\:{x}_{i}\), capturing the correlation between the features and the response at various positions of the distribution, rather than solely focusing on the mean. Formally, the quantile regression model is defined as33,34:
$$\:{Q}_{{\uptau\:}}\left(y|x\right)={x}^{T}{{\upbeta\:}}_{{\uptau\:}}$$
Here, \(\:{Q}_{{\uptau\:}}\left(y|x\right)\) denotes the \(\:{\uptau\:}\)-th quantile of the target y given x, x stands for the vector of inputs, and \(\:{\beta\:}_{\tau\:}\) represents the vector of QR coefficients35.
Formulated as a minimization issue, the quantile regression coefficients estimate can be found as36:
$$\:\underset{{{\upbeta\:}}_{{\uptau\:}}}{\text{min}}{\sum\:}_{i=1}^{n}{{\uprho\:}}_{{\uptau\:}}\left({y}_{i}-{x}_{i}^{T}{{\upbeta\:}}_{{\uptau\:}}\right)$$
Here, \(\:{{\uprho\:}}_{{\uptau\:}}\left(u\right)\) denotes the check function which can be expressed as32:
$$\:{{\uprho\:}}_{{\uptau\:}}\left(u\right)=u\left({\uptau\:}-I\left(u<0\right)\right)$$
The indicator function \(\:I\left(\cdot\:\right)\) equals 1 if (u < 0), and 0 otherwise. This plays a key role in the check function, which implements an asymmetric loss function. The observed asymmetry facilitates the quantile regression model in adeptly capturing the diverse relationships between input parameters and the various quantiles of the response target output. By weighting residuals differently above and below the quantile, the model adjusts its focus depending on the chosen quantile (\(\:{\uptau\:}\)), leading to more flexible and precise estimates.
Identifying the coefficients in this model necessitates addressing a particular optimization challenge. The final estimates provide important new perspectives on how the input variables affect various response variable quantiles, so enabling a more comprehensive knowledge of their effects over the distribution than only at the mean35.
Software and programming libraries
To ensure the reproducibility of our study, we provide details of the software platforms and programming libraries used, along with their respective version numbers. This information is critical for researchers aiming to replicate or extend our work, as it ensures compatibility and consistency across computational experiments.
The following software and libraries were utilized:
These versions were current at the time of the study and were selected to ensure stability and compatibility across the computational experiments conducted.
