Comparing mechanism-based and machine learning models for predicting the effects of glucose accessibility on tumor cell proliferation

Cell culture and treatment with glucose uptake inhibitor Cytochalasin B

We obtained the triple negative breast cancer cell line, MDA-MB-231³³, from the American Type Culture Collection (ATCC, Manassas, VA) and maintained them in culture according to manufacturer recommendations. We seeded the MDA-MB-231 cells in ninety-six well-plates covering multiple initial confluences (30–90%) in Dulbecco’s modified eagle medium (DMEM without glucose, sodium pyruvate, HEPES, L-glutamine and phenol red, Thermo Fisher Scientific, Waltham, MA) the day prior to image acquisition began. The cells were incubated in the IncuCyte S3 live cell imaging system (Essen BioScience, Ann Arbor, MI) at 37 °C for 4 days, supplied with 5% CO₂ and air. Before image acquisition began, we changed the cell culture medium to DMEM with designated glucose (0.5 mM, 1 mM, 2 mM, 5 mM, and 10 mM, Sigma-Aldrich, St. Louis, MO) and Cytochalasin B (0 μM, 2 μM, or 10 μM, Sigma-Aldrich, St. Louis, MO) concentrations. (Cytochalasin B competitively inhibits the transport of glucose with high efficiency as a glucose uptake inhibitor³⁴.) After the medium change, we added Cytotox Red (Essen BioScience, Ann Arbor, MI) to the medium to monitor cell death. (Cytotox Red enters the cell plasma through the disintegrated membrane of dead cells and then binds to DNA to emit fluorescent signals.) Analysis of the time-lapse fluorescent images allowed the quantification of cell death over time. We prepared four replicates for each initial condition.

With the IncuCyte S3 live cell imaging system, we collected phase-contrast and red fluorescent (excitation wavelength: 585 nm and emission wavelength: 635 nm) images of whole-wells every 3 h for 4 days. While details are provided elsewhere²⁶, the salient features are as follows. A 4 × objective was used to image both channels. The whole-well images were acquired through stitching of multiple images collected by raster scanning across each well. This allows the tracking of both the total confluence (percentage of area covered by cells) and area covered by dead cells (labeled with Cytotox Red) during the entire experiment.

Image processing for cell segmentation

We used Matlab (The Mathworks, Inc., Natick, MA) to perform cell segmentation and generate the data for our modeling, including the time-resolved confluence curves from each well for both live and dead cells. Briefly, we masked the region of interest (ROI) and converted RGB (red, green, blue) images to grayscale for images from both channels. For phase-contrast images we binarized the image based on pixel-wise signal intensity. For fluorescent images we applied a sliding block for edge detection and a Gaussian filter for area smoothing then normalized and binarized the images. Complete details are provided in our previous work²⁶.

Mechanism-based model

We begin with the most parsimonious model selected from our previous study, which used a function of glucose level (details on model selection and derivation are provided in ref. ²⁶) to model the temporal change of tumor cell number and extend it to include the effect of treatment with Cytochalasin B. Our complete model is described by a series of coupled ODEs to calculate the number of live, N(t), and dead cells, D(t), at any given time point, t, as shown below:

$$\frac{dN
(1)

$$\frac{dD
(2)

$$\frac{{dG_{total}
(3)

$$S_{d} \left( {G_{acs}
(4)

$$S_{p} \left( {G_{acs}
(5)

$$G_{acs}
(6)

All model functions, variables, and parameters are described in Table 1. Additionally, the reader is encouraged to consult Figs. 1 and S1 for a brief and detailed, respectively, overview of our approach.

Table 1 The definitions, units, source, and value for the model parameters.

As Cytochalasin B is a glucose uptake inhibitor, we introduced G_acs(t) to describe the glucose level accessible to cells (i.e., the effective glucose concentration within the extracellular environment, due to the impact of the glucose uptake inhibitor) at time t, in comparison to G_total(t), which is the actual glucose level in the system (i.e., the extracellular glucose concentration) at time t. The three terms on the right-hand side of Eq. (1) describe logistic tumor cell growth, tumor cell death induced by glucose depletion, and tumor cell death induced by the bystander effect^26,35,36, respectively. Death can be induced in the remaining live cells with factors^35,36 released by dead cells into the environment. We considered this phenomenon as a manifestation of the bystander effect and included it in the model²⁶ as cell death was consistently underestimated when only considering glucose depletion. Equation (2) describes the accumulation of dead cells from both sources of death (i.e., glucose depletion and the bystander effect). Equation (3) describes the change of total glucose concentration due to the consumption by tumor cells as a function of accessible glucose concentration, which is the glucose level cells could actually utilize. Equations (4) and (5) are state functions, scaling the proliferation rate or glucose depletion induced death rate in real-time as a function of glucose level accessible to cells, where G_min is minimal glucose level required to proliferate. Equation (6) describes the relationship between the accessible glucose concentration and total glucose concentration, where G_in is an inhibition constant dependent on the dose of Cytochalasin B. When there is no treatment, G_acs is equal to G_total and this extended model becomes the same as the baseline model. This formulation assumes Cytochalasin B only effects the ability of tumor cells to access glucose, and does not directly inhibit glycolysis, pyruvate metabolism, or oxidative phosphorylation.

Machine learning models

The prediction of tumor cell growth can be formulated as a supervised learning task; that is, as a regression problem given a set of input features. In this study, we considered the number of both live and dead tumor cells at time zero (i.e., N(t = 0) and D(t = 0)), initial glucose concentration (i.e., G_total(t = 0)), Cytochalasin B dose, and the time of measurement as input features. The number of tumor cells at any given time point for both live and dead cells (i.e., N(t) and D(t)) are the prediction targets.

We employed four commonly used machine learning algorithms: multivariate linear regression, k-nearest-neighbor regression, decision tree regression, and random forest regression. Multivariate linear regression seeks to estimate the result of a variable of interest, using a number of explanatory variables³⁷. It is a parametric approach as it assumes a linear functional form mapping from the input variables to the output variable(s). While parametric approaches are generally easy to fit with a small set of coefficients, they rely on a strong assumption about the form of the linear function, which may not always be correct. Therefore, we also explored non-parametric approaches that do not assume an explicit functional form, thereby allowing greater flexibility. The first non-parametric approach we introduced was the k-nearest-neighbor method^38,39. This algorithm calculates a weighted average (equal to the inverse of the distance) of the k-nearest-neighbor as an estimate for numerical variables. While the non-parametric approaches are more flexible, they can often be difficult to understand and interpret biologically. Therefore, we included a decision tree regression approach⁴⁰, which takes observations about a sample (represented as the branches) to estimates of the target variable (represented as the leaves). Decisions trees are more straightforward to interpret as they follow simple decision rules constructed depending on the data features. Finally, we investigated random forest regression^41,42 which is an ensemble learning method that trains multiple decision-tree models to achieve better performance in prediction compared to results from a single model^43,44,45. This algorithm employs bootstrap aggregating (or bagging⁴⁶) to generate multiple random samples with replacement to train multiple decision trees and then uses the average prediction from each tree to yield a final prediction. Random forest regression can correct for overfitting^47,48,49 resulting from a decision trees analysis⁵⁰, but at the expense of a more difficult to interpret single decision tree.

All machine learning models were implemented in Jupyter Notebooks with the Python library Scikit-learn. To implement the models, the number of cells were first normalized to a value between 0 and 1 (i.e., confluence) by scaling to the carrying capacity (Table 1). Splitting the timeseries data into a training and validation set requires special consideration to avoid data leakage⁵¹ (i.e., when future information is included in the training set, but similar data is not available when the model is used for prediction, resulting in unrealistically high performance in training and potentially poor generalization on unseen data). Our data was split into a training set containing 75% of the samples and a separate validation set containing the remaining 25% (more details can be found in section “Training and validation”). Models were trained only on the training set to estimate model parameters. The trained model was then evaluated on the validation set. As our interest lies in comparing with the mechanism-based model, it is crucial to be consistent on the features we use, properties that we can measure or control in experiments. Therefore, we only used the five features described at the beginning of this section (i.e., N(t = 0), D(t = 0)), G_total(t = 0), Cytochalasin B dose, and the time of measurement) and did not apply further feature selection or dimension reduction. Default hyperparameters as provided by the Scikit-learn library were used for these algorithms and no parameter tuning was performed, allowing us to evaluate their basic performance on our dataset and establish a baseline comparison with the mechanism-based model.

Model calibrations

Throughout the following section, the reader is encouraged to refer to Figs. 1 and S1. Figure 1 provides an overview of the methodology used in the mechanism-based model, while Fig. S1 provides a more detailed schema of the whole calibration and prediction process.

We used two datasets for our model calibration. Dataset A was obtained from our previous work²⁶ and consists of 120 pairs of confluence time courses for live and dead cells from each sample. In Dataset A, the cells were seeded at low, intermediate, and high initial confluences with 10 initial glucose concentrations (0, 0.1, 0.2, 0.5, 0.8, 1, 2, 5, 8, and 10 mM) and were not treated with Cytochalasin B. Dataset B was obtained from new experiments and consist of 180 pairs of confluence time courses for live and dead cells. In Dataset B, the cells were seeded at low, intermediate, and high initial confluences with five initial glucose concentrations (0.5, 1, 2, 5, and 10 mM). Of the 180 wells imaged, 60 wells were not treated, 60 wells were supplied with 2 μM Cytochalasin B, and 60 wells were supplied with 10 μM Cytochalasin B. In both datasets, there were four replicates for each initial condition. Note that for the mechanism-based model, the cell confluences (of both live and dead cells) are multiplied by the carrying capacity θ (i.e., the maximum number of cells that can physically fit in the area, Table 1) and converted into the number of live and dead cells (N(t) and D(t), respectively).

Calibration of the mechanism-based model with Dataset A

The model described in section “Mechanism-based model” (i.e., Eqs. (1)–(6)) was first calibrated to experimental data (i.e., time-resolved measurements of tumor cell number for live and dead cells) from Dataset A, using the initial glucose level and confluence as the initial conditions (Fig. S1, blue arrows, step 1). It was assumed (based on the results from previous work²⁶) that the proliferation rate, k_p, the consumption rate of glucose, v, and the glucose depletion induced death rate, k_d, were characteristics of the cell line, while the bystander effect induced death rate, k_bys, was dependent on initial glucose level. Therefore, we obtained estimates for three global parameters (i.e., a set of parameters shared and applicable across the entire dataset; k_p, k_d, and v) and a local parameter (i.e., a parameter specific for a given sample; k_bys) for each of the 120 time courses in Dataset A. The confidence intervals for all three parameters were also computed.

Establishing relationships between mechanism-based model parameters and initial glucose level

The estimates of k_bys obtained from section “Calibration of the mechanism-based model with Dataset A” were fit to Eq. (7):

$$k_{bys} = k_{bys,0} \exp ( – \alpha G_{0} ) + \beta ,$$

(7)

where k_bys,0 is the maximum k_bys rate when there is sufficient glucose, α represents the dependence on the initial glucose level, G₀ represents the initial glucose level, and β represents a baseline offset (Fig. S1, orange arrows, step 2). This follows the approach developed in our previous work²⁶, assuming k_bys presents an exponential decay depending on the initial glucose concentration. With the estimates of these three parameters, Eq. (7) can be used to estimate k_bys for a new initial condition (i.e., Dataset B) not contained in the training dataset (i.e., Dataset A). With new initial conditions from Dataset B, Eq. (6) was used to represent the initial accessible glucose level G_acs,0, and it is substituted for G₀ in Eq. (7). k_bys is therefore represented as a function of G_in for each sample in Dataset B (Fig. S1, green arrows, step 3).

Calibration of the mechanism-based model with Dataset B to reduce the parameter space for G
_in

The model described in section “Mechanism-based model” was calibrated to experimental data (180 pairs of time-resolved measurements of tumor cell number for live and dead cells) from Dataset B, using the initial glucose level and confluence as the initial conditions (Fig. S1, red arrows, step 4). In this calibration, the three global parameters k_p, k_d, and v, were replaced with the estimates from section “Calibration of the mechanism-based model with Dataset A”, while the local parameter k_bys was obtained from step 3 as described in section “Establishing relationships between mechanism-based model parameters and initial glucose level”. In our modeling, we have the ability to model G_in as a global parameter (i.e., one value for all samples included in the experiments) or a local parameter (i.e., one value for each sample), or somewhere in between those two extremes. Through our simulation experiments, we found having two values, one for each Cytochalasin B concentration, yielded the best model performance. With this approach we can avoid having too much flexibility in the model and risk overfitting, but also make sure the model is capable of capturing the trends apparent in the data. Therefore, in step 3 we only estimate the inhibition constant G_in as a global parameter, assuming k_bys still follows the patterns observed in our previous study²⁶; i.e., k_bys decreases with increasing initial glucose level following an exponential decay. We obtained an estimate of G_in for cells treated with 2 μM and 10 μM of Cytochalasin B, respectively, assuming the inhibition is dose dependent. In this step k_bys was considered as a known parameter (fixed for each sample, assigned with values from Eq. (7)), instead of a free parameter that needs calibration, to reduce the model flexibility and therefore the parameter space of G_in (i.e., to narrow down the potential values of G_in). Confidence intervals of G_in were also obtained.

Calibration of the mechanism-based model with Dataset B to estimate k
_bys and G
_in

The model described in section “Mechanism-based model” was calibrated to experimental data (180 pairs of time-resolved curves of tumor cell number for live and dead cells) from Dataset B, using the initial glucose level and confluence as the initial conditions (Fig. S1, purple arrows, step 5). The three global parameters k_p, k_d, and v, were replaced with their estimates from section “Calibration of the mechanism-based model with Dataset A”. However, as opposed to the approach described in section “Calibration of the mechanism-based model with Dataset B to reduce the parameter space for Gin”, k_bys was treated as a free local parameter and estimated together with the global parameter G_in, with the confidence intervals obtained from section “Calibration of the mechanism-based model with Dataset B to reduce the parameter space for Gin” serving as the bounds for estimating G_in.

Validation of the relation between local k
_bys and accessible glucose concentrations

The estimates of G_in obtained from the last calibration performed according to section “Calibration of the mechanism-based model with Dataset B to estimate kbys and Gin”, along with the initial total glucose level G_total(t = 0) and number of live cells N(t = 0) were used to calculate the initial accessible glucose level G_acs(t = 0) using Eq. (6). The estimates of k_bys as a local free parameter from the same calibration were compared to both G_total(t = 0) and G_acs(t = 0) to determine if they follow an exponential decay as we discovered in our previous work²⁶. If the pattern is again observed, then the estimates of k_bys were fit to Eq. (7) to update the estimates of k_bys,0, α, and β (Fig. S1, brown arrows, step 6).

Evaluation of the performance of the mechanism-based model

We introduced five criteria, namely the coefficient of determination (R²), mean percent error over the time course, percent error at the end of experiment, mean error over the time course, error at the end of experiment (more details on these errors are described in Table S1) to evaluate model performance and compare across different approaches.

Prediction with the mechanism-based model

Given new initial conditions of glucose level, confluence, and dose of Cytochalasin B, we can predict tumor cells with the calibrated model (Fig. S1, cyan arrows, step 7). The initial accessible glucose level G_acs(t = 0) can be calculated using Eq. (6) and then used in Eq. (7) to calculate k_bys. With the initial confluence and initial accessible glucose level, we can solve Eqs. (1)–(6) to run the model forward and obtain time-resolved curves of live and dead tumor cells.

Validation of the accessible glucose level

We sought to validate the assumption that the glucose level accessible to the tumor cells was lower than the actual glucose level in the system, and then provide an estimate of the accessible glucose level for the cells treated with Cytochalasin B. To perform this test, we compared the confluence time courses of the tumor cells with different initial glucose levels but with no exposure to Cytochalasin B. The concordance correlation coefficient (CCC) was employed as a measurement of the agreement between the two sets of time-resolved curves. We hypothesized that cells growing in environments with similar glucose levels would present similar growth curves. Therefore, we compared time-resolved cell number curves of tumor cells with treatment to time-resolved cell number curves of tumor cells without treatment and with similar initial confluences, but with different initial glucose levels. The growth curve without treatment that yielded the highest CCC when compared to the growth curve with treatment provided the best estimate of the accessible glucose level for the treated cells.

Training and validation

Training and test sets

The data set was randomly split into training (75%; n = 225) and validation sets (25%; n = 75). The training set was used to develop the predictive models and the testing set was used as an independent data set for model validation. The splitting for Datasets A and B was performed separately. The training set A_train (75%; n = 90) and the validation set A_valid (25%; n = 30) were divided from Dataset A, while the training set B_train (75%; n = 135) and the validation set B_valid (25%; n = 45) were divided from Dataset B.

Training and validation for the mechanism-based model

The model described in section “Mechanism-based model” (Eqs. (1)–(6)) was calibrated to A_train to obtain estimates for the global parameters k_p, k_d, and v, as well as estimates for the local parameter k_bys via the approach described in section “Calibration of the mechanism-based model with Dataset A”. The estimates of k_bys were fit to Eq. (7) with the associated initial glucose concentration from A_train via the approach described in section “Establishing relationships between mechanism-based model parameters and initial glucose level” to yield estimates of k_bys,0, α, and β. Equations (1)–(6) were first calibrated to B_train with the approach described in section “Calibration of the mechanism-based model with Dataset B to reduce the parameter space for Gin” to obtain a preliminary estimate of G_in with its confidence interval. Equations (1)–(6) were calibrated to B_train again with the approach described in section “Calibration of the mechanism-based model with Dataset B to estimate kbys and Gin” to obtain estimates of both G_in, as a global parameter, and k_bys, as a local parameter, with the confidence interval obtained from first calibration serving as the bounds for estimating G_in. The estimates of k_bys were fit to Eq. (7) to estimate k_bys,0, α, and β with the approach described in section “Validation of the relation between local kbys and accessible glucose concentrations”. The initial conditions of each sample from A_valid and B_valid were used in the forward model to predict the tumor cell growth over time with the approach described in section “Prediction with the mechanism-based model” and compared with the experimental measurements in dataset A_valid and B_valid. The training and validation were repeated 50 times following the steps described above; a new random train-test split was performed each time to generate the data sets needed. Each time, we calculated the R² and errors listed in section “Evaluation of the performance of the mechanism-based model” for model performance evaluation. We report the average model performance over the 50 rounds of training and validation.

Training and validation for machine learning models

The training and validation process were repeated 50 times. In each round, A_train and B_train (generated as described in section “Training and test sets”) were combined as the training set for the machine learning models. Similarly, A_valid and B_valid (generated as described in section “Training and validation for the mechanism-based model”) were combined as the validation set for the machine learning models. In each round, each of the four machine learning models were trained on the combined training set. Predictions were obtained using the initial conditions in the combined validation set, after the machine learning models were trained. The predictions from the initial conditions given in the validation set were compared with the observations in the validation set. In each round, we calculated the coefficient of determination (R²) to for model performance evaluation. We report the average model performance (with 95% confidence interval) over 50 rounds of training and validation.

Learning curves for the machine learning models

For each round of training and validation, the impact from the size of the training set was studied. The four machine learning models were trained on training sets with size increasing from 5 to 100% (with a step size of 5%) of the complete training set (randomly selected) and the model performance on both training and validation sets were evaluated. This process was repeated for 50 rounds.

Statistical analysis

Tumor cell growth time courses were obtained from four experiments for each set of initial conditions, and each point in each time course consisted of a mean ± 95% confidence interval (a one-sample Kolmogorov–Smirnov test confirmed normality). Two‐way ANOVA was used to compare the average number of live cells for each experiment at the end of day 4 between the groups with different initial conditions. A Student’s t-test was used to test for the statistical difference of estimates for G_in between the two Cytochalasin B concentrations. One‐way ANOVA was used to compare the performance (R²) of all five models. The explicit error metrics used to evaluate the mechanism-based model performance can be found in Table S1. The CCC was used to measure the agreement between two time-resolved curves. Bonferroni correction was applied to control the family-wise error rate (“FWER”) when comparing the performance between any two models, where each hypothesis is tested at a significance level of the selected alpha divided by the number of hypothesis tests to guarantee that the probability of having one or more false positives is less than alpha.

Source link

创建个人账户 commented on WestMetric Defends Controversial On-Page SEO Services for the Era of AI: Your article helped me a lot, is there any more re
Registro commented on Security Architect | eFinancialCareers: Thanks for sharing. I read many of your blog posts
Anm"al dig f"or att fa 100 USDT commented on Best ChatGPT Tips and Tricks shared by ChatGPT Experts: Turbo-Charge Your AI Experience: Prompts included | by Michael King | Oct, 2023: Thanks for sharing. I read many of your blog posts
Elizabeth Nash commented on AI platform Hugging Face says hackers have stolen authentication tokens from Spaces: 🌍 Global crypto mining is now at your fingertips h
Binance美国注册 commented on Meta’s Mark Zuckerberg on Threads, the future of AI, and Quest 3: Your article helped me a lot, is there any more re

Comparing mechanism-based and machine learning models for predicting the effects of glucose accessibility on tumor cell proliferation

Cell culture and treatment with glucose uptake inhibitor Cytochalasin B

Image processing for cell segmentation

Mechanism-based model

Machine learning models

Model calibrations

Calibration of the mechanism-based model with Dataset A

Establishing relationships between mechanism-based model parameters and initial glucose level

Calibration of the mechanism-based model with Dataset B to reduce the parameter space for G
_in

Calibration of the mechanism-based model with Dataset B to estimate k
_bys and G
_in

Validation of the relation between local k
_bys and accessible glucose concentrations

Evaluation of the performance of the mechanism-based model

Prediction with the mechanism-based model

Validation of the accessible glucose level

Training and validation

Training and test sets

Training and validation for the mechanism-based model

Training and validation for machine learning models

Learning curves for the machine learning models

Statistical analysis

Leave a Reply

RECENT POSTS

AI is changing NEET preparation, but not replacing the basics

BluSky AI Inc. Announces Appointment of Mort Aaronson to Board of Directors

Companies are restricting employees from using AI because the cost is too high

Cell culture and treatment with glucose uptake inhibitor Cytochalasin B

Image processing for cell segmentation

Mechanism-based model

Machine learning models

Model calibrations

Calibration of the mechanism-based model with Dataset A

Establishing relationships between mechanism-based model parameters and initial glucose level

Calibration of the mechanism-based model with Dataset B to reduce the parameter space for G in

Calibration of the mechanism-based model with Dataset B to estimate k bys and G in

Validation of the relation between local k bys and accessible glucose concentrations

Evaluation of the performance of the mechanism-based model

Prediction with the mechanism-based model

Validation of the accessible glucose level

Training and validation

Training and test sets

Training and validation for the mechanism-based model

Training and validation for machine learning models

Learning curves for the machine learning models

Statistical analysis

Related Posts

Leave a Reply

Calibration of the mechanism-based model with Dataset B to reduce the parameter space for G
_in

Calibration of the mechanism-based model with Dataset B to estimate k
_bys and G
_in

Validation of the relation between local k
_bys and accessible glucose concentrations