Development of several machine learning-based models for determining small molecule drug solubility in binary solvents at different temperatures.

A clear, structured multi-step approach was used to investigate how solvent composition, temperature, and mass fractions affect the solubility of rivaloxaban. This systematic method allows for a detailed analysis of the effects of each factor on solubility. The dataset used in this study was like the dataset used in other works for correlation of rivaloxaban solubility in mixed solvents, and that procedure was used.^4,6,7. Source data and descriptions of features have been reported elsewhere⁵.

This study utilized a multi-solvent dataset containing important variables such as dichloromethane (W), temperature (T), and solvent type mass fractions. Target parameter X shows the drug solubility of binary mixtures containing dichloromethane and several alcohols such as ethanol, methanol, propanol, and butanol.⁵. The solubility of the API is molar fraction (mol/mol) throughout modeling and is used as dimensionless in the display of the results. To represent the solvent categories without implying an order, one-hot encoding was applied, and each solvent was treated as an independent binary function. This encoding strategy supports effective multivariate modeling by allowing accurate capture of interactions between temperature, solvent identity, and mass fraction (w) in terms of solubility.⁴.

The dataset for this study consists of 220 data points indicating the drug solubility of a binary mixture of dichloromethane and methanol, ethanol, n-butanol, and n-propanol. Each solvent system is evenly represented, with 55 data points per solvent, covering 11 different mass fractions (0-1 in increments of 0.1) at five different temperature levels (283.15 k to 308.15 K).⁵. Solubility values (x) were measured for each unique combination of temperature and mass fractions, allowing for comprehensive multivariate analysis. Solvent types were encoded using one-hot encoding to maintain the categorical structure without introducing order bias. This balanced, systematically structured data set guarantees robust modeling and allows for a meaningful interpretation of the effects of temperature and solvent composition on API solubility operation.

Pre-treatment method

1 Encoding category features using hot encoder

In machine learning, converting categorical variables to numerical formats is essential for models where numerical input needs to work effectively. In this transformation, one-hot encoding is widely used and represents each category with binary values. This ensures that the model handles each category individually.¹².

Due to the presence of the categorical variable “solvent”, one hot encoding was employed to modify the dataset of ML modeling. Each unique category was represented as a separate binary function, with a value of 1 indicating the presence of that solvent in a particular instance, and a 0 indicating that it is not present⁴. This encoding approach eliminates the implicit hierarchical relationships between solvent types, allowing the model to treat each category as an independent attribute. As a result, this encoding method allows categorical data to be incorporated into feature sets without imposing order-related assumptions or biases.

Data normalization

In machine learning, normalization of data is an important step in standardizing functional scales across the dataset, making the model training process more efficient and stable. Unnormalized data plays an important role in models affected by the scale of input function by disproportionately highlighting certain variables over other variables, thereby reducing overall model effectiveness.

By using the MIN-MAX scaling technique, all values are scaled uniformly between 0 and 1. 1, represents this process¹³:

$$ \:\begin {array}{c}{x}_{\text {scaled}} = \frac {x- {x}_{\te xt {min}}} {{x}_{\text {max}} – {x}_{\text {min}}} \end {array} $$

(1)

Scaling the functionality to a consistent range helps the model converge more efficiently during training. This is to avoid the advantage of functionality for a larger range than a smaller range. This also improves the stability and performance of many ML algorithms⁴.

Outlier detection

Statistics assume a multivariate normal distribution and use an elliptic envelope to identify anomalies. In this way, the central distribution and variability of the data points are effectively modeled by defining the elliptical region that captures most of the sample. Outliers are observations outside this range¹⁴.

The procedure begins by calculating the mean vector and the covariance matrix. Using these statistical estimates, the algorithm constructs an ellipse intended to enclose the core distribution of the data. It is mathematically necessary to optimize the volume of the resulting ellipsoid by ensuring that at least 95% or 99% of the data is included. Anomaly points are identified as points outside the boundary of a defined ellipse.

Bayesian Neural Network (BNN)

In contrast to deterministic neural networks, BNN considers network weights as probability distributions rather than static values, facilitating reliable assessments of both predicted mean and uncertainty. bias b and weight w Because the prior distribution of each layer, usually a Gaussian distribution, is assigned, $\:p\left(w\right)\sim\:\mathcal{n}\left(0,{\upsigma\:}}^{2}i\right)$where $\:{\upsigma \:} $ Hyperparameters that control previous variance¹⁵. Rear distribution during training $\:p \left(w | d \right)$ Weight exceeds considering training data d Approximated using variational inference. A mean field mutation distribution is adopted $\:q \left(w | {\uptheta \:} \right)$ It is parameterized by $\:{\uptheta \:} $Kurubach – leibler(kl) differences optimized, true rear and $\:q \left(w | {\uptheta \:} \right)$. This is achieved by optimizing the lower bound of evidence (Elbo) expressed as follows:

$$ \:elbo \:= \:{e} _ {q \left(w | \ theta \:\right)} \left[\text{log}p\left(D|W\right)\right]\: – \:kl \:\left(q \right(w \left | \ theta \:\rigk) \:\left | \right | \:p \left(w \right)$$

(2)

Measure the likelihood of the data in the first term, and the second normalizes the distribution¹⁶. Dropouts are integrated as Bayesian approximations and scale predictive variances to explain model uncertainty¹⁷. The output layer produces a predicted distribution $\:p\left(y|x,\:d\right)\:\amptx\:\mathcal{\:}\mathcal{n}\left(\Mu\:\left(x;\:w\right),\:{\sigma\:}^{2}\left(x;\:w\right)\right)$ here $\:\mu \:$ and $\:{\sigma \:}^{2} $ It is learned and provides both point predictions and uncertainty estimates. The training uses Adam Optimizer with hyperparameters adjusted via SFS methods. This framework ensures robust generalization and makes it ideal for tabular regressions where uncertainty is important, as tested in previous work (see Figure 1).¹⁸.

Neural Forgetting Decision Ensemble (Node)

Nodes integrate the interpretability of decision trees with the adaptability of neural networks for table regression tasks. Nodes consist of an ensemble of forgetting decision trees, where each tree shares split decisions across all data points at a specific level, reducing variance and increasing stability. There is an ensemble in the implementation t Trees of fixed depths each l (for example, L = 3) and k It is divided into levels and optimized to capture feature interactions. Each node in the tree applies a split function $\:s\left(x;{\uptheta\:}\right)$parameterized by neural network weights $\:{\uptheta \:} $map input functions x To binary decisions based on thresholds. A split function can be expressed as $\:s\left(x;{\uptheta\:}\right)={\upsigma\:}\left({w}^{t} x+b\right)$where $\:{\upsigma \:} $ Activation of sigmoids w It's weight b A bias that determines whether a data point moves left or right within the tree. Final prediction of input x The average of the ensemble: $\:\widehat{y}\left(x\right)=\frac{1}{t}{\sum\:}_{t=1}^{t}{f}_{t}\left(x;{{{\uptheta\:}}\where \(\:{f} _{t} $ This is the output of t– Tree. Training minimizes average square error (MSE) loss: $\:\mathcal {l} = \frac {1}{n}{\sum \:}_{i=1}^{n}{\left({y}_{i} – \widehat {y}\left({x}_{i}\right)\right)}^{2}+{\uplambda \:}{\left | \left | {\uptheta \:} \right| \right|} _{2}^{2}$uses stochastic gradient descent using Adam Optimizer. $\:{\uplambda \:} $ Represents normalization parameters. Hyperparameters t, land k It is adjusted via cross-validation. This mathematical framework, as shown in previous work, guarantees a balance between node interpretability and predictive power, and ensures that it outweighs traditional trees.¹⁹. The procedure is schematically shown in Figure 2.