Graph theoretic and machine learning approaches in molecular property prediction of bladder cancer therapeutics

Regression analysis is a foundational tool in statistics and machine learning used to explore and quantify relationships between variables. Among the most widely used approaches are linear and cubic regression models, each serving distinct purposes depending on the complexity of the data and the nature of the relationships involved.

A linear regression model assumes a straight-line relationship between an independent variable and a dependent variable. The general form is:

$$y = \beta _0 + \beta _1x + \epsilon$$

where $y$ is the predicted outcome, $x$ is the predictor, $\beta _0$ and $\beta _1$ are coefficients, and $\epsilon$ is the error term. This model is favored for its simplicity, ease of interpretation, and low computational cost. It is best suited for data where the relationship between variables remains constant across the range.

However, linear regression has limitations when applied to more complex data structures. It lacks the capacity to capture curvature or changing trends in data behavior, often leading to underfitting when non-linear patterns are present.

A cubic regression model enhances flexibility by incorporating polynomial terms up to the third degree:

$$y = \beta _0 + \beta _1x + \beta _2x^2 + \beta _3x^3 + \epsilon$$

This model is capable of capturing more complex, non-linear relationships, including inflection points and changing rates of growth or decline. Cubic regression is particularly useful in fields like pharmacokinetics, economics, and environmental modeling, where variables do not interact in strictly linear ways.

Despite its adaptability, cubic regression carries certain drawbacks. It is more susceptible to overfitting, especially when applied to small or noisy datasets. Overfitting reduces a model’s ability to generalize to new data, thus limiting its predictive utility. Moreover, interpreting the influence of each term becomes less intuitive as complexity increases.

Table 4 Statistical parameters and regression models for $M_1(G)$.

Table 4 shows the statistical parameters and regression models of different properties in terms of the thermal index (TI) for material $M_1(G)$. The calculated properties are $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. All the properties are developed through both linear and cubic models. The respective statistical parameters are the correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and p-value. Typically, the cubic models are found to have improved performance in all of the properties compared to the linear models. This can be seen from the uniformly higher $R$ and $R^2$ values and the minimal standard errors of the cubic models. For instance, the $MR$ property finds very high correlationship with both linear ($R = 0.980$) and cubic ($R = 0.986$) models, with the cubic model providing higher accuracy. Likewise, the $MV$ property shows an improvement in $R^2$ from 0.922 to 0.926 and a decrease in $S_E$ from 12.209 to 11.758 while moving from the linear to the cubic model. All the models prove to be statistically significant, with their respective p-values at 0.000, an indicator that the regression fits as shown in Fig. 2, especially the cubic ones, are very reliable in describing the behavior of $M_1(G)$ properties as functions of $TI$.

Table 5 Statistical parameters and regression models for ${M_2(G)}$.

Table 5 summarizes statistics parameters and regression models for different material $M_2(G)$ properties as functions of the thermal index, $TI$. The properties considered are $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. Both linear and cubic regression models were fitted to each property, and their performance is assessed with the use of statistics measures such as the correlation coefficient, $R$, coefficient of determination, $R^2$, standard error, $S_E$, F-statistic, F, and the p-value. The cubic models tend to fit better than the linear models, as indicated by higher $R$ and $R^2$ values and lower standard errors. For instance, the $MR$ property returns an $R$ value of 0.968 for the linear model and an $R$ value of 0.978 for the cubic model, with respective $R^2$ values of 0.938 and 0.957. Likewise, the $MV$ property indicates that there is an improvement in model quality, with the cubic fit lowering the standard error from 9.162 to 7.392. The models are all statistically significant with associated p-values of 0.000, confirming the robustness of the models as shown in Fig. 3. The cubic models are particularly well-suited to model nonlinear trends in the property–TI relationships for $M_2(G)$.

Table 6 Statistical parameters and regression models for H(G).

Table 6 shows the statistical parameters and regression models for different material properties of $H(G)$ as functions of the thermal index ($TI$). The material’s properties that are analyzed are $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. Both linear and cubic models are fitted to each property, with performance evaluated through the use of the correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and p-value. The data affirm that the cubic models tend to have better performance compared to the linear ones, as revealed by improved $R$ and $R^2$ values alongside decreased standard errors. For example, the property $MR$ attains very high correlations under both models, with $R = 0.993$ for the linear model and $R = 0.997$ for the cubic model, and respective values of $R^2 = 0.985$ and $R^2 = 0.994$. $MV$ and $P$, too, have very high predictive performance, especially under cubic modeling. All models are significant statistically with p-values of 0.000, which verifies the validity of the regressions as shown in Fig. 4. The findings indicate that the cubic models are very effective in portraying the nonlinear relationship among thermal index and property variation for $H(G)$.

Table 7 Statistical parameters and regression models for F(G).

Table 7 shows the regression models and statistical parameters of different material property $F(G)$ with respect to the thermal index $TI$. The considered material properties are $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. Both linear and cubic regression models have been utilized, and model validity was evaluated with respect to critical statistics: correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and p-value. By and large, the cubic models provide better fit and accuracy for all the properties, with greater $R$ and $R^2$ values, and lesser standard errors. Particularly, the property $MR$ shows high model fidelity, with the linear model giving $R = 0.952$ and $R^2 = 0.907$, while the cubic model raises these to $R = 0.962$ and $R^2 = 0.926$, respectively. Correspondingly, the property $P$ gains substantially from cubic modeling, raising $R^2$ from 0.896 to 0.916. All of the models are statistically significant with respective p-values of 0.000, reflecting very robust relationships as shown in Fig. 5. This highlights the use of cubic models in being able to describe the intricate, nonlinear behavior of the variation of property with thermal index for $F(G)$.

Table 8 Statistical parameters and regression models for SS(G).

Table 8 presents the statistical parameters and regression models explaining the relationship between the thermal index $TI$ and selected material $SS(G)$ characteristics. These characteristics are $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$, for which both linear and cubic models have been formulated. Quality of each model is measured with the help of the correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and the p-value. The cubic models have higher performance compared to the linear models in all the properties, with higher $R$ and $R^2$ values and lower standard errors. For instance, property $MR$ performs very well under both models, with the cubic model producing $R = 0.987$, and $R^2 = 0.974$, while the linear model offers $R = 0.985$, and $R^2 = 0.969$. Similar improvement is seen in the use of cubic models for such properties as $FP$, $MV$, and $P$. The models are all statistically significant, with all the p-values being 0.000, which confirms the significance of the regressions as shown in Fig. 6. The results indicate the efficacy of cubic models in describing the intricate dependencies of material characteristics on the thermal index in $SS(G)$.

Table 9 Statistical parameters and regression models for ABC(G).

Table 9 illustrates the statistical parameters and regression models of the material $ABC(G)$, investigating how different properties depend upon the thermal index ($TI$). The considered properties are $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. For all of them, both linear and cubic models were fitted, and assessed with the help of statistical characteristics: the correlation coefficient ($R$), the coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and the p-value. Cubic models tend to produce a truer picture of the data, as reflected in increased $R$ and $R^2$ values as well as decreased standard errors. For example, the property $MR$ reflects outstanding model precision with the cubic regression achieving $R = 0.988$ and $R^2 = 0.975$ compared to the already robust linear model’s $R = 0.987$ and $R^2 = 0.975$. Comparable improvements are seen in $FP$, $MV$, and $P$, where the cubic models take into account the nonlinear trend. All models have p-values of 0.000, which verifies that they are statistically significant as shown in Fig. 7. This further proves the strength of cubic regression models to describe the thermal index-dependent behavior of $ABC(G)$’s characteristics.

Table 10 Statistical parameters and regression models for RI(G).

Table 10 shows the regression parameters and models for the material $RI(G)$, demonstrating the effect of thermal index ($TI$) on different characteristics such as $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. All the characteristics are modeled under both linear and cubic regression methods, where model performance is assessed in terms of correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and p-value. The cubic models continue to demonstrate better predictive performance than linear models, with higher $R$ and $R^2$ values and smaller standard errors. Particularly, the property $MR$ displays excellent agreement with the cubic model, reaching $R = 0.992$ and $R^2 = 0.983$, marginally outperforming the linear model’s $R = 0.991$ and $R^2 = 0.982$. Major improvements are also observed in properties like $FP$, $MV$, and $P$, where cubic models are able to reproduce nonlinear relationships with $TI$ more closely. All, barring $SA$’s cubic fit (p = 0.001), have p-values of 0.000, which highlights their significance statistically. These findings affirm the efficacy of the cubic models to describe the sophisticated thermal behavior of $RI(G)$’s characteristics as shown in Fig. 8.

Table 11 Statistical parameters and regression models for SC(G).

Table 11 shows the statistical parameters and regression models for material $SC(G)$, which illustrates the thermal index’s impact on various important parameters: $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. Each parameter is considered in terms of both linear and cubic regression models, and model performance is evaluated in terms of the correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and p-value.

The cubic models outperform their linear counterparts consistently, with improved fit for all but one property, as indicated by increased values of $R$ and $R^2$, and decreased standard errors. For instance, the cubic model for $BP$ yields $R = 0.968$, $R^2 = 0.937$, an improvement over the linear model where $R = 0.934$, $R^2 = 0.872$. Analogously, the property $FP$ is very well-captured with the cubic model, with values of $R = 0.942$, $R^2 = 0.887$, as compared with $R = 0.941$, $R^2 = 0.886$ in the linear model. In particular, $MR$ shows very high correlation in both models, with the cubic model marginally outdoing the linear one ($R = 0.992$, $R^2 = 0.983$ compared to $R = 0.991$, $R^2 = 0.981$). The same trend is seen in characteristics such as $MV$ and $P$, where cubic models do a better job of capturing the nonlinear behavior caused due to thermal effects. All the regression models show high statistical significance with p-values of 0.000 in all cases, except for the cubic model of $SA$, which is statistically significant with a value of 0.001. These observations affirm the robustness and efficacy of cubic models of regression in portraying the complicated thermal behavior of the $SC(G)$ material as shown in Fig. 9.

Table 12 Statistical parameters and regression models for SA(G).

Table 12 summarizes the regression models and statistical parameters for the material $GA(G)$, indicating the effect of thermal index ($TI$) on each property: $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. Both cubic and linear regression models are evaluated for each property, with model performance assessed through the use of the correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and p-value.

Like with other data sets, the cubic models tend to provide enhanced predictive power compared to the linear models. The improvements are reflected in higher $R$, $R^2$, and decreased standard errors for all but one property. For instance, $FP$ with the cubic model yields $R = 0.946$, $R^2 = 0.895$, whereas the linear model yields $R = 0.945$, $R^2 = 0.893$. Particularly, the property $MR$ exhibits high predictive power with both models, and the cubic model yields $R = 0.990$, $R^2 = 0.981$, which marginally outperforms the linear model’s $R = 0.987$, $R^2 = 0.974$. The $MV$ property also shows high model fit quality, with the cubic model generating $R = 0.975$, $R^2 = 0.951$, and a lesser $S_E = 37.093$, capturing the nonlinear relationships of $TI$ more accurately. On the other hand, the property $SA$ shows weaker $R^2$ values for both models, with the cubic model, though still with increased fit, giving $R^2 = 0.726$, compared with the linear model $R^2 = 0.632$. All of the models are statistically significant with a p-value of 0.000, with the exception of the cubic model for $SA$, which is statistically significant with a p-value of 0.001. The results validate the application of cubic regression models for precise modeling of $GA(G)$ thermal response as shown in Fig. 10.

Table 13 Statistical parameters and regression models for HZ(G).

Table 13 shows the statistical parameters and regression models of the material $HZ(G)$, indicating how the thermal index ($TI$) influences important features like $BP$, $EV$, $FP$, $MR$, $SA$, $MV$, and $P$. Both linear and cubic models are utilized, with their performance measured in terms of the correlation coefficient ($R$), coefficient of determination ($R^2$), standard error ($S_E$), F-statistic (F), and p-value.

Cubic models tend to show enhanced predictive accuracy compared to linear models with higher $R$ and $R^2$ values along with lower standard errors for many of the properties. For instance, while the cubic model for $FP$ shows $R = 0.948$ and $R^2 = 0.898$, an improvement over the linear model’s values of $R = 0.934$ and $R^2 = 0.872$, the cubic model does well with $R = 0.942$, $R^2 = 0.888$, and lower $S_E = 33.89$ for $MV$, reflecting improved capture of the non-linear thermal characteristics. The parameter $MR$ also exhibits high agreement under both models, with the cubic model returning $R = 0.961$, $R^2 = 0.923$, very slightly higher than the linear model’s $R = 0.957$, $R^2 = 0.916$. $P$ also shows high agreement under both models, though with the cubic fit returning higher predictive accuracy ($R = 0.960$, $R^2 = 0.922$). All models have excellent statistical significance, with all the p-values at 0.000, further supporting the application of cubic models to express the intricate thermal dependences of the $HZ(G)$ material’s properties. These findings confirm that cubic regression models are more accurate and reliable in the description of the thermal response behavior for this material as shown in Fig. 11.

Table 14 Comparison of actual and predicted drug response values for BP.

Table 15 Comparison of actual and predicted drug response values for EV.

Table 14 displays an exhaustive comparison of observed and calculated Boiling Point (BP) values under different experimental conditions. Both cubic and linear regression models were utilized to predict BP as a function of the independent variable $F$. The data for actual BP reflects great variability throughout the experiments, reflecting the complicated physiological character of such response variables. The cubic model of regression always displays the best fit with the data, with the predicted values closest to actual measures, particularly in the cases of higher or lower values. This increased correspondence indicates that $F$ is not linearly correlated with BP, and hence, the cubic model is more effective in accommodating these fluctuations. The linear model, in contrast, does reasonably well but under- or over-estimates where the data are curved. The residuals in these areas point to where the assumption of a straightforward linear dependency in BP prediction may fall short. In total, the analysis verifies that, in the case of BP, the use of a higher-order polynomial model, i.e., cubic regression, yields more accurate prediction. This indicates that BP responses depend upon several interacting variables, which are best described through non-linear methods.

Table 15 contains the observed and calculated values of Enthalpy of Vaporization (EV) with linear and cubic models. In contrast to BP, there is a consistent and stable trend in the EV values in the experiments. Both the linear and cubic models have close agreement with the actual EV values. Yet, there is little difference between the models, indicating that the relationship of $F$ and EV is mostly linear. The linear model makes very consistent predictions with little variation from the actual values, and it is an efficient and interpretable model to use for EV. Although the cubic model does add some flexibility, the performance improvement it offers in this application is marginal. This indicates that the increased complexity may not be justified, particularly in light of the model parsimony principle.

Table 16 illustrates the comparison of observed with fitted values of Flash Point (FP) with linear and cubic models. The observed FP values indicate moderate variability, indicating possible non-linearity in the relationship between FP and the independent variable $F$. The cubic regression model’s projections tend to be closer to the true values compared to the linear model. This is especially because, where FP takes mid-to-high values, the linear model will tend to over-simplify the trend. The cubic model can accommodate slight curvatures in the data due to its flexibility, leading to decreased prediction errors. The linear model, although easier to interpret and more straightforward, is seen to underperform in some experiments, especially at the boundaries of the value range. This highlights the necessity of looking into higher-order models whenever the data show non-linear behavior. In brief, the FP analysis shows that the cubic regression yields more precise projections and more accurately reflects the underlying dynamics of the response variable than the linear model.

Table 17 shows the actual and expected values of Molar Refrectivity (MR) with both linear and cubic regression models. The data show high variability of MR among the experiments, suggesting an intricate relationship with $F$. The cubic model performs better than the linear model to describe these variations. The cubic model’s predicted MR values are more consistent with actual values, particularly at extreme positions where the linear model shows deviation. This indicates the presence of non-linear effects that are more effectively dealt with by the cubic method. The linear model performs well in the middle range values but falters with more dynamic fluctuations, confirming demand for more responsive modeling in such situations. These observations indicate the use of cubic regression when modeling parameters with inherently non-linear profiles.

Table 18 contrasts observed and predicted values of drug response with molar volume as the descriptor in several regression models. The data show variability in the range of drug responses, indicating variability in how pharmacological behavior is influenced by Mv. The accuracy of the prediction varies with models, with models $M_1$ and $M_2$ broadly producing higher correlations with actual values. These models accurately predict throughout the range of responses, indicating they are more likely to describe both linear and subtle non-linear relationships of MW. All other models make mispredictions at times, especially for the compounds with unusually high or low response values. Notably, certain deviations from actual and predicted values indicate MV alone may not adequately capture the intricacies ofdrug interactions, especially among compounds with more varied chemical characteristics. However, the pattern as a whole indicates MW as an important parameter in forecasting drug response, particularly when utilized in robust models with flexibility of functional form. These findings validate the application of MV in regression models, but they also underpin the potential advantage of applying hybrid or multi-variable methods for enhanced prediction performance.

Table 19 assesses the variable $P$, presumably a physicochemical or structural attribute, against drug response. The values calculated from the models are in moderate to high agreement with real data, although performance varies significantly from model to model. Model $M_2$ demonstrates the highest predictive correspondence, especially when there are edge situations where more adaptive modeling is helpful. This suggests that $P$ is in an intricate relationship with drug response, which may incorporate non-linear behavior or threshold effects. On the contrary, linear models fare poorly in describing all this complexity, especially with data outliers. In spite of these difficulties, mid-range projections are relatively consistent throughout the majority of models, suggesting the presence of linear behavior in the data. The weaknesses in outlier projections, though, affirm the utility of models that can do non-linear mapping, particularly when projecting biological traits such as $P$. In summary, both MW and $P$ are useful predictors, but careful choice of type of regression is necessary. The use of non-linear or higher order regression increases accuracy and accommodates the biological variability present in drug response data.

Table 16 Comparison of actual and predicted drug response values for FP.

Table 17 Comparison of actual and predicted drug response values for MR.

Table 18 Comparison of actual and predicted drug response values for MV.

Table 19 Comparison of actual and predicted drug response values for P.

Table 20 Comparison of actual and predicted drug response values for SA.

Table 20 shows a comparison of observed and modelled values of drug response with Surface Area as the leading descriptor for various regression models. The data show significant variability in the drug responses, which indicates that there may have been a non-linear relationship between SA and pharmacological activity. Among these models, the highest accuracy is shown by models $M_1$ and $M_2$, which closely agree with the real values of the drug response over an appreciable range of compounds. This agreement is consistent and points towards these models’ ability to model both linear and non-linear trends present in the data. On the other hand, the simpler models such as $M_1$ deviate considerably at the limits, which reflects the inability of the linear model to adequately characterize the SA’s effect on drug response. The observed inconsistencies, especially among compounds with very high or very low response values, highlight the possible limitations of underfitting in models with reduced flexibility. Such patterns indicate the presence of inherent interactions or thresholds that are reflected more effectively with more sophisticated regression methods. Additionally, the consistency of prediction in mid-range values for all of the models shows that SA does have a level of linearity in how it relates to response to drugs. This, though, is not sufficient for high-accuracy prediction, particularly in edge situations, further confirming the necessity for adaptive models that can adapt to localized trends in the data. In general, the evidence supports the value of SA as an effective descriptor in drug response modeling, especially when coupled with high-level regression methods. The conclusions validate the application of higher-order or non-linear models to reduce error and enhance predictive accuracy for pharmaceutical use.

Source link

binance "oppna konto commented on Forget Ray-Ban Meta smart glasses. We tested cheaper ones that support ChatGPT.: Thanks for sharing. I read many of your blog posts
Binance账户 commented on The Smartest Man Who Ever Lived: Your point of view caught my eye and was very inte
打开Binance账户 commented on Top 10 Machine Learning Jobs with the Best Salaries in 2023: Your point of view caught my eye and was very inte
binance Registrera dig commented on Generative-AI-Jobs: Die 11 gefragtesten KI-Berufe: Thanks for sharing. I read many of your blog posts
create a binance account commented on WHOOP 4.0 review: Fitness tracker brand launches new AI features: Can you be more specific about the content of your

Graph theoretic and machine learning approaches in molecular property prediction of bladder cancer therapeutics

Leave a Reply

RECENT POSTS

IFS’ Sophie Graham talks about AI enabling sustainability — not yet sustainable AI

Apple becomes second $5 trillion company in history as investors flee AI stocks | Apple

Telekom Srbija uses SAS to modernize customer engagement and AI-driven marketing

Related Posts

Leave a Reply