Predictive optimization of curcumin nanocomposites using hybrid machine learning and physics informed modeling

Machine Learning


Data collection and curation

Seventy-four (n = 74) distinct curcumin nanocomposite formulations were systematically compiled from peer-reviewed scientific literature, encompassing polymeric, lipidic, metallic, and hybrid delivery systems. Data mining was performed using a structured and reproducible workflow following previously reported literature-mining protocols for nanomaterials4. Following a PRISMA-guided protocol12, electronic databases including Scopus, PubMed, Web of Science, and ScienceDirect were comprehensively searched for articles published until March 2025 using the terms “curcumin nanocomposite”, “curcumin nanocarrier”, “encapsulation efficiency”, “loading efficiency”, “drug delivery”, and “nanoparticle formulation”. Boolean operators (“AND”, “OR”) were applied to reduce search specificity and maximize coverage. Records retrieved (n = 241) were exported to Zotero for reference management and then filtered through a three-step data mining process that included title-level filtering for duplicates or irrelevant records removal, abstract-level filtering for experiment relevance, and full-text extraction through semi-automated extraction in Python (BeautifulSoup and pandas) to extract numerical data fields like particle size, zeta potential, Loading Efficiency (LE%), and Encapsulation Efficiency (EE%). Pattern-matching and text-mining algorithms were employed to normalize and extract numeric values from tables and text, and uncertain entries checked manually. Data were stored in structured CSV format for preprocessing, and each record was verified for completeness, with the requirement being a minimum of two physicochemical descriptors in addition to LE% or EE%. Following automated extraction, data consistency was validated by range checking (particle size 1–1000 nm, zeta potential − 100 to + 100 mV) and duplicate cross-checking across studies. The initial dataset of 241 records was reduced to 121 unique formulations across 86 studies following these validation checks. Further filtering based on inclusion and exclusion criteria for quantitative completeness and comparability resulted in 74 high-quality datasets that met all modeling requirements. Exclusion was used for studies that did not report quantitative LE% or EE% data, provided qualitative descriptions only, tested non-curcumin drugs or non-nanocomposite systems, or were reviews, conference abstracts, patents, or computational-only reports without experimental verification. The parameters were rescaled to a common measurement unit and the missing values were imputed by multiple imputation using chained equations (MICE). The resultant high-quality curation pipeline produced an even distribution in a balanced and reproducible dataset with nanocarrier types of polymeric = 33%, lipidic = 27%, metallic = 19%, and hybrid = 21%. The final number of formulations (n = 74) was thus determined after strict inclusion and exclusion criteria for data completeness and comparability. Only those studies that reported both LE% and EE% at well-characterized physicochemical conditions like particle size, zeta potential, and material composition were included, while formulations with non-standard units of measure or partial descriptors were not. The screening exercise initially yielded 121 potential formulations from 86 publications but after checking for data integrity, consistency, and completeness of independent variables, 74 high-quality records were selected for modeling. This sum is a quality-controlled and statistically adequate sample with balanced representation from formulation classes (polymer = 33%, lipid = 27%, metal = 19%, hybrid = 21%). This methodological filtering ensured that every entry would contribute as much as possible to a strong and generalizable machine learning model while minimizing noise from incomplete or non-comparable studies.

Although the dataset provides an overview of curcumin formulation space reported, it pools together data from different sources involving variable synthesis protocols, analytical conditions, and reporting styles. Such heterogeneity may introduce systematic bias into feature–response relations through variability in measurement practice, solvent system, or assay calibration standards. To counter these potential biases, several corrective measures were taken, including normalization of all of the quantitative descriptors (particle size, zeta potential, LE%, and EE%) to standard units and ranges of measurement, utilization of multiple imputation and robust scaling to reduce the impact of missing or outlying values, and utilization of leave-one-group-out cross-validation, where each group corresponded to a single publication to prevent over-representation of any given experimental dataset. In addition, physics-informed regularization was added to impose model behavior to adhere to mechanistic first principles of mass balance, DLVO interactions, and diffusion constraints, reducing dependence on dataset-specific artifacts. Collectively, these steps removed heterogeneity effects, generalized models more effectively, and resulted in predictive trends showing intrinsic physicochemical relationships rather than publication-related bias. The tailored dataset of nanocomposite composition, physicochemical characteristics, cytotoxicity profiles, and loading/encapsulation efficiencies after preprocessing is provided in Table S1.

Data preprocessing and feature engineering

The raw dataset contained different reporting tendencies and missing values, which required a rigorous preprocessing procedure. Missing data were treated through multiple imputation with chained equations (MICE) for continuous variables and mode imputation for categorical variables. Natural language processing (NLP) methods were employed to transform qualitative cytotoxicity reports into quantitative descriptors13,14. All the measurements were normalized to have consistent representation across trials. Other descriptors were built using domain expertise such as the surface area-to-volume ratio (from spherical geometry), a stability index derived from zeta potential measurements, composition categories (polymer, lipid, metal, or hybrid), and binary markers for PEGylation or targeting ligands15,16.

Machine learning modeling

Several supervised regression models were developed and implemented in Python (version 3.11; Python Software Foundation, Wilmington, DE, USA; https://www.python.org/) using officially sourced open-source libraries from the Python Package Index (PyPI). Classical regression models were constructed using Scikit-learn (version 1.3.0; https://scikit-learn.org/), while the Physics-Informed Neural Network (PINN) architecture was implemented using TensorFlow–Keras (version 2.13.0; https://www.tensorflow.org/). Model interpretability was achieved through the SHapley Additive exPlanations (SHAP) library (version 0.43.1; https://shap.readthedocs.io/). Data mining and preprocessing employed BeautifulSoup (version 4.12.2; https://www.crummy.com/software/BeautifulSoup/). and pandas (version 2.1.1; https://pandas.pydata.org/). All computational procedures were executed in a Windows 10 Pro (64-bit) environment on a workstation equipped with an Intel Core i9 processor, 64 GB RAM, and an NVIDIA RTX 4080 GPU, ensuring reproducibility and transparency. Bayesian optimization coupled with fivefold cross-validation was used to tune hyperparameters and achieve the optimal predictive capability (best R2)17,18. Hybrid architecture was embraced not as an alternative to Python, TensorFlow, or PyTorch, but as an implementation strategy therein. The inspiration for this modeling framework was tripartite:

Incorporation of physics-based constraints

Where PyTorch and TensorFlow provide deep-learning backbones for general application, our hybrid framework enabled direct inclusion of problem-specific physical equations (conservation of mass, DLVO potential, diffusion kinetics) within the loss function; capability outside the purview of generic regression or black-box neural models.

Data interpretability and efficiency

Due to the limited size of the dataset (n = 74), deep neural networks typically used in PyTorch or TensorFlow alone would tend to overfit. The Gradient Boosting Regressor (GBR) enabled by scikit-learn handles small heterogeneous datasets better and can be interpreted through SHAP analysis. The PINN module, which was implemented in TensorFlow, adds value by regularizing learning based on physical principles rather than blind data dependency.

Computational feasibility

The chosen layout leverages Python-based interoperability, where data preprocessing and statistical learning were conducted in scikit-learn, and mechanistic neural modeling was done in TensorFlow. PyTorch was also explored but not used since it requires greater GPU memory requirements for symbolic differentiation in physics-informed regularization.

Physics-informed neural networks (PINNs)

A physics-informed neural network (PINN) was specifically developed to incorporate quantitative physicochemical constraints explicitly into learning19,20,21,22. The total loss function (\(L_{total}\)) combined the data-driven loss and physics-inspired regularization terms:

$$L_{total} = \lambda_{d} L_{data} + \lambda_{m} L_{mass} + \lambda_{DLVO} L_{DLVO} + \lambda_{diff} L_{diff} + \lambda_{thermo} L_{thermo}$$

where \(L_{data}\) is the mean squared error between predicted and experimental LE% or EE%, \(L_{mass}\) enforces mass conservation, \(L_{DLVO}\) constrains the electrostatic–van der Waals balance according to colloidal stability theory, \(L_{diff}\) represents diffusion limitation based on particle size, and \(L_{thermo}\) penalizes predictions exceeding the solubility limit. The coefficients \(\lambda_{d}\), \(\lambda_{m}\), \(\lambda_{DLVO}\), \(\lambda_{diff}\), \(\lambda_{thermo}\) are empirically optimized weighting factors.

  • Mass conservation term \(\left( {L_{mass} } \right)\) imposed the sum of total mass of encapsulated and free curcumin to be equal to the initial dose, provided by

    $$L_{mass} = \left\| {\left( {m_{input} – m_{encap} – m_{free} } \right)/m_{input} } \right\|^{2}$$

    ensuring that predictions are within 0–100% encapsulation limits.

  • DLVO constraint \(\left( {L_{DLVO} } \right)\) included electrostatic–van der Waals balance through

    $$L_{DLVO} = \left\| {\frac{{A_{vdw} }}{{h^{2} }} – \frac{{64\pi n_{0} k_{B} Ttanh^{2} \left( {e\zeta /4k_{BT} } \right)e^{ – kh} }}{\epsilon } – E_{pred} } \right\|^{2}$$

    where \(E_{pred}\) is the predicted colloidal interaction energy, ensuring stable predictions between − 30 to − 50 mV zeta-potential.

  • Diffusion limitation term \(\left( {L_{diffusion} } \right)\) suspended predicted loading efficiency by the Fickian diffusion limit

    $$L_{diffusion} = \left\| {D_{eff} \nabla^{2} C – \frac{\partial C}{{\partial t}}} \right\|^{2}$$

    where \(D_{eff} \propto 1/r^{2}\), correlating particle size to encapsulation kinetics.

  • Thermodynamic constraint \(\left( {L_{thermo} } \right)\) penalized physically impossible loading predictions beyond the theoretical solubility limit \(S_{max}\)

    $$L_{thermo} = \left\| {max\;\left( {0, LE_{pred} – S_{max} } \right)} \right\|^{2}$$

The optimization of weighting coefficients was done empirically (\(\lambda_{1}\) = 0.3, \(\lambda_{2}\) = 0.2, \(\lambda_{3}\) = 0.2, \(\lambda_{4}\) = 0.3) to reduce validation data total loss. This clear formulation ensured model predictions conformed to known physical principles with high predictive accuracy and therefore better generalization and reduced unphysical extrapolations.

Mathematical foundations of implemented machine learning algorithms

The core mathematical principles underlying the machine learning algorithms implemented in this study are elaborated below:

Random forest regressor

This ensemble method operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees. The fundamental equation governing its prediction is:

$${\hat{\text{y}}} = 1/N\mathop \sum \limits_{i = 1}^{N} T_{i} \left( x \right)$$

where \({\hat{\text{y}}}\) represents the final prediction, \(N\) denotes the number of trees in the forest, and \(T_{i} \left( x \right)\) signifies the prediction of the i-th decision tree for input x. This aggregation method reduces variance and minimizes overfitting through bootstrap aggregation.

Gradient boosting regressor

Gradient Boosting builds an additive model iteratively, minimizing a differentiable loss function by combining weak learners. The model update at stage m is expressed as:

$$F_{m} \left( x \right) = F_{m – 1} \left( x \right) + {\upnu }. h_{m} \left( x \right)$$

where \(F_{m} \left( x \right)\) is the ensemble model at iteration m, ν is the learning rate, and \(h_{m} \left( x \right)\) is the weak learner fitted to the negative gradient of the loss function. This sequential correction reduces bias and enhances accuracy.

Support vector regression

SVR aims to find a function that deviates from the true targets by at most ε, while maintaining maximum flatness. The optimization problem is:

$$\min 1/2\left\| {\text{w}} \right\|^{2} + {\text{C}}\sum \left( {\upxi _{i} +\upxi _{i}^{*} } \right)$$

subject to:

$$y_{i} – \left\langle {w|x_{i} } \right\rangle – b \le \varepsilon +\upxi _{i}^{*}$$

$$\left\langle {w|x_{i} } \right\rangle + b – y_{i} \le \varepsilon +\upxi _{i}^{*}$$

$$\upxi _{i} ,\upxi _{i}^{*} \ge 0$$

where w is the weight vector, C is the regularization parameter, \({\upxi }_{i} , {\upxi }_{i}^{*}\) are slack variables, and ε defines the margin of tolerance.

XGBoost

XGBoost is an optimized gradient boosting framework incorporating both loss minimization and regularization. The objective function is:

$$Obj = \sum l({\text{y}}_{i} + {\hat{\text{y}}}_{i} ) + \sum \Omega \left( {{\text{f}}_{k} } \right)$$

with:

$$\Omega \left( {{\text{f}}_{k} } \right) = \gamma T + 1/2\lambda \left\| {\text{w}} \right\|^{2}$$

where \({\text{l}}({\text{y}}_{i} + {\hat{\text{y}}}_{i} )\) measures prediction error, T denotes the number of leaves in the tree, γ controls minimum loss reduction for splits, and λ is the L2 regularization coefficient.

Multilayer perceptron

An MLP is a feedforward neural network that applies nonlinear transformations across multiple layers. For a single layer, the operation is:

$$y = f\left( {W_{x} + b} \right)$$

where x is the input vector, W is the weight matrix, b is the bias, and f is a nonlinear activation function (e.g., ReLU, sigmoid, tanh). Successive compositions of this transformation across hidden layers enable the network to capture complex, hierarchical patterns in the data.

Model interpretation and validation

SHapley Additive exPlanations (SHAP) was used to estimate global and local feature contributions such that prediction accuracy and model interpretability were ensured23. Model robustness was extensively validated by an exhaustive strategy starting with an 80:20 train-test data split24. This was supplemented by fivefold cross-validation to determine consistency of performance across various subsets of data25. Leave-one-group-out cross-validation was employed to measure generalization across material categories, organized by nanocomposite type26. External validation was performed using 12 unique formulations that were entirely excluded from the training process27. This rigorous validation technique ensured model generalizability across both established and new nanocomposite systems.



Source link