Data collection and survey area
The retrospective cohort analysis study examined 554 regions in four regions in Thailand (north, central, northeast and south). Data were collected from two major sources.
CCA Case Data
Information from four population-based cancer registries (PBCR): Northern (Lampang Cancer Hospital), Central (LOP Buri Cancer Hospital), Northeast (Khon Kaen Provincial Cancer Registry), and Southern Region (Surat Thani Cancer Hospital) [23]. All CCA cases were diagnosed between January 1, 2012 and December 31, 2021. This is the International Classification of Diseases in Oncology, 3rd Edition (ICD-O-3), Specific Codes: C22.1 (Internal Hepatic Biliary Duct), C24.0 (Parathyroid Tube), C24.8 (C24.8 (bi biliion of biliion), C24.8 (bi), TRACT, NOS) (Except C24.1, Vater's ampulla) [24, 25]. Key variables include gender, age of diagnosis, date of birth, ICD-O-3 code, address, and diagnostic basis. Population data from the Office of the National Economic and Social Development Committee [26] It was used to calculate age standardization rates (ASRs) for every five years from 2012 to 2021 (Table 1).
Spatial variables
First, environmental data (elevation, water source coordinates, and regional scale and scope) from the Central Global Informatics Systems and Services Project, Ministry of Water Resources and Ministry of Natural Resources and Environment [27]. Second, climate data (average rainfall, average temperature and coordinates for all weather stations) was obtained from the Thai Weather Division using a statistical data request system. [28]. All spatial variables were aggregated at the district level (Table 1).
Survey area
This study covers four provinces representing four major regions (latitude and long term) of each size and geographic coordinates in Thailand: (i) Lampang Province (north): 12,533.96 km217.2°-19.5°N, 98.9°-100.2°E; (ii) lop buri (center): 6,208.70 km214.6°-15.8°N, 100.3°-101.5°E; (iii) Khon Kaen (Northeast): 10,885.99 km215.6°-17.1°N, 101.6°-103.3°E; (iv) Surat Thani (South): 12,891.4 km28.3°-10.2°N, 98.5°-100.2°E, each region represents each region, [23].
Variables and Measurements
ASR, age standardization rate. CCA, bile duct carcinoma; IACR, International Cancer Registry Association.
Statistical analysis
CCA incidence rate
ASR was calculated by gender and standardized using SEGI global standard population estimates [29]. International Cancer Registry Association (IACR) Guidelines [30] It was used to calculate the ASR of CCA cases in each district.
Machine Learning Models
Four different machine learning models were implemented to predict CCA incidence based on spatial variables. The data management process utilized residential address codes as an important identifier for linking CCA case data with all spatial factors. Distribution tests were performed on all variables prior to analysis. If the data showed an unusual distribution pattern (left or right skewness), a logarithmic transformation was performed on all affected variables before proceeding to the machine learning model. Each model represents a different approach to predictive modeling chosen to provide a comprehensive comparison of techniques applied to spatial epidemiology.
Linear regression
A statistical model that examines the linear relationship between dependent variables (ASRs of CCA) and multiple independent variables (spatial factors). This model represented a traditional statistical approach and assumed a linear relationship between variables, so we chose it as a baseline comparison.
Random Forest
An ensemble learning method that constructs multiple decision trees during training and outputs the average predictions for individual trees. Random forests are suitable for spatial epidemiology as they capture nonlinear relationships, handle interactions between variables, and are robust to overfits. This algorithm works with observations and variable bootstrap samples to construct a variety of decision trees, each tree contributing to the final prediction to vote [31]. The Random Forest model consisted of the following specifications: Number of trees = 500. Variables randomly sampled with each split (MTRY) = 2. Minimum node size = 5; and the Gini criterion was adopted as the split criterion.
Neural Networks
A computational model inspired by the neural structures of the human brain, designed to recognize complex patterns through interconnected nodes (neurons). A neural network processes information through three main components: Input layers (receiving spatial variables), hidden layers (processing information over weighted connections), and output layers (generating CCA occurrence predictions). This architecture allows neural networks to model highly complex and nonlinear relationships between spatial factors and disease incidence. [32, 33]. The neural network used a 5->15->10->1 architecture with RELU activation in the hidden layer and linear activation in the output layer. In training, we used an early stop to optimize performance using an Adam Optimizer, L2 Remuliation (weight decay = 0.0001), a learning rate of 0.01 with 32 batch sizes and 200 epochs.
Extreme gradient boost (xgboost)
Advanced implementation of gradient boost that builds models in sequence. Each new model corrects the errors made by the previous model. Xgboost has three important components: (i) a loss function for assessing the accuracy of the model, (ii) an additional model combining (a) a weak learner slightly better than a random guess (usually a decision tree), and (iii) a weak learner into a powerful prediction system. Xgboost includes regularization techniques to prevent overfitting, making it potentially valuable for spatial predictions with limited data. [34]. The XGBoost model consisted of a learning speed of 0.05, a maximum tree depth of 6, and a minimum child weight of 3. The sample ratio for both the subsamples and columns was set to 0.8. For regularization, alpha and lambda parameters were established at 0.2 and 0.1, respectively. This model was trained using 1000 boost rounds with early stop mechanisms to prevent overfitting and optimization.
Model Training and Verification
Randomly divide the dataset into training (70%) and testing (30%) subsets for model development and evaluation. This ratio was chosen to balance the need for appropriate training data while ensuring sufficient test data for reliable performance assessments, taking into account sample size constraints. The 70:30 split is widely used in machine learning applications and provides a good compromise between these competing needs.
We examined alternative split ratios (80:20, 90:10) by evaluating the same model as all actual analyses, but preliminary analysis showed that a 70:30 split provided an optimal balance between model learning and validation of dataset size. In our study, in about 554 subdivisions, this division provided 388 subdivisions (4,465 cases) for training and 166 cases (1,914 cases) for testing.
Table 1 presents the complete research methodology from data collection to model evaluation. The process began by collecting CCA case data from four regional cancer registrations and spatial data from government databases. After preprocessing, including computation of ASR values and standardization of spatial variables, we implemented a 70:30 random division stratified per region to maintain proportional representations. Each model was trained using the same training data and hyperparameter optimization technique and evaluated on a common test set using RMSE,R.2and visual assessments using scatter plots.
Model evaluation
To ensure a robust evaluation of model performance, we implemented a comprehensive evaluation framework using three complementary approaches.
Root mean square error (RMSE)
RMSE quantifies prediction errors for the same unit as the dependent variable, giving large errors a large weight. This is important in health applications where serious errors can have serious consequences. This metric calculates the square root of the mean square difference between the predicted and actual CCA incidence rates.
$$ rmse = \sqrt \lbrack \sum(predicted-actual)2/n \rbrack $$
Lower RMSE values improve model performance with fewer prediction errors. We chose RMSE over alternative metrics such as mean absolute errors (MAE) as RMSE gives greater weight to large errors via a square mechanism, and is particularly valuable for health applications where large predictive errors can have greater consequences for resource allocation and intervention planning. This sensitivity to outliers may perform well on average, but can help identify models that are generated for errors in a particular region or range of incidence.
R-squared (r2))
This coefficient of determination measures the percentage of variance in the dependent variable (ASR of CCA) explained by the independent variable (spatial factors) of the model. By providing a simple scale from 0 to 1, we can measure the percentage of variance explained in the incidence of CCA, facilitating meaningful comparisons with previous research results.
r2 = 1- (sum of square residuals/square sum)
r2 Values ranged from 0 to 1, with values close to 1 indicating that the model explains most of the variance in CCA incidence rates, suggesting improved predictive performance. For each model, a 95% confidence interval of R was calculated.2 Using 1000 iterations using bootstrap resampling, quantifying uncertainty in performance estimates, allowing for more rigorous statistical comparisons between models.
Spread the plot
Scatter plots were created to visualize the relationship between predicted and actual CCA incidences for each model. These visual representations serve multiple analytical purposes.
-
Identify patterns of prediction accuracy across different incidence levels.
-
Identify potential systematic biases (e.g., consistent over-judgment in sensitive regions).
-
Detects prediction error inhomogeneity.
-
Identify clusters or outliers in the region that may require special attention.
We strengthened these scatter plots with full predictions, regression lines showing actual trends, and 45-degree reference lines representing color codes by region, allowing for a deeper visual analysis of model performance.
After a comprehensive comparison of these models, variable-critical analyses were performed using optimal performance models (random forests) to identify the major spatial predictors of CCA incidence. This analysis quantifies the average reduction in prediction accuracy when each variable is excluded from the model, while keeping all other variables constant. Related approaches:
-
1.
Training the optimal random forest model on a complete dataset
-
2.
Sort each predictor variable one at a time (effectively delete that information while maintaining the same data structure)
-
3.
Measure the resulting reduction in prediction accuracy
-
4.
Ranking variables based on the impact on model performance
This permutation-based approach directly measures the effect on the predictive performance of the model rather than changes in node purity, thus offering advantages over alternative variable-important methods and provides more interpretable results directly related to the predictive target.
I used R random forest, Nura network, xgboost, and statistical packages to implement the model. All analysis and visualizations were performed using R software version 4.2.1 (R core team). [35] rstudio software version 1.4.1 [36]. Spatial data processing used SF and raster packages, while visualization used GGPLOT2 for optimal clarity. Statistical validation, including confidence interval calculations, was implemented using resamples.
