Integrating machine learning and spatial clustering for malaria case prediction in Brazil’s Legal Amazon | BMC Infectious Diseases

Machine Learning


Dataset

The data for this study were obtained from the Malaria Epidemiological Surveillance Information System (SIVEP-Malaria),Footnote 1 a database maintained by the Brazilian Ministry of Health. SIVEP-Malaria is the official system for the mandatory reporting of malaria cases in the Amazon Region and provides comprehensive data on malaria cases, treatments, and related variables, enabling systematic monitoring and analysis across Brazil [21].

Access to this system is restricted and requires authentication via a user-specific login and password. The data used in this study were obtained following approval from the Research Ethics Committee of FMT-HVD under approval number CAAE.

Although SIVEP-Malaria encompasses all federal units of Brazil, in 2023 over 99.98% of all malaria cases are concentrated in the Legal Amazon [7], which comprises nine states: Acre, Amapá, Amazonas, Maranhão, Mato Grosso, Pará, Rondônia, Roraima, and Tocantins. However, it is important to note that malaria burden is not evenly distributed across these states. The northwest region, particularly in states like Amazonas, Roraima, and Pará, experiences the highest incidence, whereas Maranhão and Tocantins report significantly fewer cases [7, 8].

This study focuses on the location where malaria cases were officially notified, regardless of whether they were autochthonous or imported. Although the states of Maranhão and Tocantins have a lower number of reported cases compared to other states in the Legal Amazon, they were included in the analysis to provide a comprehensive regional assessment. Each state was analyzed independently, and municipalities were grouped into distinct clusters based on notification patterns. This approach minimizes potential biases in the predictive models and ensures that localized transmission dynamics are accurately captured.

The dataset used in this study includes all confirmed malaria cases reported between 2003 and 2022, aggregated at a weekly granularity. The choice of this extended time frame is essential for training artificial intelligence models, as these models rely on historical data to recognize patterns and learn long-term trends. By incorporating a broad temporal window, the models can better capture seasonal variations, cyclical outbreaks, and long-term shifts in malaria transmission, improving their predictive accuracy.

Figure 1 illustrates the spatial distribution of malaria risk in the Legal Amazon, based on the Annual Parasite Index (API) for the year 2022, an epidemiological indicator that estimates the risk of contracting malaria in a given population. The use of the API allows a more accurate representation of the intensity of disease transmission, considering both case counts and population size. The municipalities at highest risk identified were Jacareacanga (Pará), Japurá (Amazonas), Alto Alegre (Roraima), Barcelos (Amazonas) and São Gabriel da Cachoeira (Amazonas). These findings highlight critical areas where malaria transmission remains intense and persistent, highlighting the need for predictive models capable of supporting targeted control strategies in high-risk and often underserved regions.

Fig. 1
figure 1

Annual Parasite Index (API) for the year 2022

Although the epidemiological landscape has evolved over the past two decades, recent surveillance data confirm that malaria transmission remains concentrated in the same high-burden areas [7, 8]. This validates the use of long-term historical data to enhance predictive modeling while ensuring its relevance for current and future malaria control strategies.

Data preprocessing and feature engineering

Figure 2 provides an overview of the entire data preprocessing workflow designed to create a new dataset that can be utilized in computational models for time-series forecasting.

Fig. 2
figure 2

The initial step involved extracting raw data from the SIVEP-Malaria system, which contains malaria notification records, geographic identifiers, and laboratory confirmed test results. Subsequently, the annual datasets were consolidated into a single comprehensive database to streamline data analysis and manipulation.

Following data integration, invalid, duplicate, and negative result entries were identified and removed to prevent redundancy and inconsistencies. In Brazil, the Ministry of Health recommends verifying parasite clearance five days after the initiation of treatment, and these follow-up test results are recorded in the SIVEP-Malaria system. However, in this study, such follow-up entries were excluded to avoid data duplication and minimize potential bias related to recurrence or treatment monitoring. Since the analysis focuses specifically on the Amazon region, records from municipalities outside the Legal Amazon were also excluded, enabling a more geographically targeted and epidemiologically relevant assessment.

After data cleaning, feature selection was performed, keeping only variables relevant to predicting malaria cases. Non-epidemiological attributes, such as administrative metadata unrelated to disease occurrence, were discarded. Date values were standardized by converting them to a DateTime format and grouping weekly, ensuring temporal consistency between records. The variables selected from the SIVEP-Malaria system for this study included: Date of notification, Municipality code of notification and Number of confirmed cases.

Subsequent to the initial preprocessing, the dataset was integrated with demographic data provided by the Brazilian Institute of Geography and Statistics (IBGEFootnote 2). This integration step is crucial to enhance the epidemiological analysis, as it allows the calculation of notification rates adjusted for population size. The integration was conducted through a merging operation using standardized municipality codes provided by IBGE, ensuring that each malaria record was accurately linked to its corresponding demographic data.

After merging the datasets, an additional attribute representing the Notification Rate was generated. This rate will be employed by the K-means clustering algorithm to group municipalities within each state. The Notification Rate is defined as the ratio between confirmed malaria cases and the local population, as presented in Equation.

$$Notification\;Rate=\left(\frac{Number\;of\;confirmed\;malaria\;cases}{Total\;population\;of\;the\;city}\right)\;x\;1000$$

This normalization enables fair comparisons of malaria incidence across different regions by accounting for population size disparities. The final output consists of a processed dataset covering the period from 2003 to 2022. All codes used in data preprocessing are available at: https://github.com/dotlab-brazil/Malaria-AmazoniaLegal.

The datasets used in this study were obtained from the SIVEP-Malaria system, maintained by the Brazilian Ministry of Health. These data have been fully anonymized and contain aggregated records of confirmed malaria cases reported across municipalities within Brazil’s Legal Amazon region from 2003 to 2022.

The dataset includes the following variables: date of notification, municipality of notification, laboratory test result, and notification count (i.e., the number of confirmed malaria cases per day, by municipality). The processed dataset that supports the findings of this research is publicly available on the Mendeley Data Repository at: https://data.mendeley.com/datasets/9n6b97fsbd/2.

Statistical analysis

K-means clustering

Cluster analysis is a technique used to group samples in a dataset based on shared characteristics [22]. The K-means algorithm is one of the most recognized methods for data clustering, employing unsupervised classification to partition data into a predefined number of clusters, denoted as k. This algorithm operates by evaluating elements based on the Euclidean distance from the cluster centroids [23].

The K-means algorithm begins by randomly selecting initial centroids, which serve as central points for each cluster. Elements are then assigned to clusters based on their proximity to the centroids. This process iterates until convergence. Initially, each element is assigned to the nearest centroid, as described by the following equation:

$${S}_{i}^{

  • S is the distance between the element and the centroid.

  • i indicates the cluster number.

  • xp represents the number of assignments to the closest point.

  • µ(t) is the centroid value.

  • j is the dissimilarity measure.

  • (t) refers to the number of iterations of the algorithm.

  • k is the number of clusters.

  • After the assignment step, the algorithm updates the centroids by calculating the average of the observations in each cluster.

    This standardized measure allows the identification of clusters of municipalities with similar statistical epidemiological patterns, regardless of their geographic location. The K-means algorithm groups cities in the same state based on statistical similarity in the behavior of malaria cases, rather than their geographic proximity. This approach allows for more data-driven segmentation, which may ultimately support more precise public health interventions.

    For each state, the optimal number of clusters was determined by the elbow method using the Within-Cluster Sum of Squares (WCSS) as the evaluation metric. This method identifies the point at which the explained balances capture significant variability while avoiding overfitting due to excessive clustering [24].

    Forecasting models

    Time series forecasting involves predicting future values based on previously observed data points, taking into account temporal dependencies and patterns such as trends and seasonality [25]. In epidemiology, time series models can be used to anticipate disease outbreaks, allowing public health authorities to allocate resources more effectively and implement timely interventions.

    To assess the performance of different approaches for predicting weekly malaria cases, this study evaluates six computational models: LSTM, GRU, SVR, RF, XGBoost, and ARIMA. The models were strategically selected to compare both traditional forecasting approaches and more recent techniques. LSTM and GRU are modern architectures based on Recurrent Neural Networks (RNNs) designed to capture long-term temporal dependencies in time series data [26, 27]. SVR, RF, and XGBoost are traditional machine learning models known for their robustness [28,29,30,31,32,33,34]. Finally, ARIMA is a widely used statistical model for time series forecasting and serves as a reliable baseline for comparison with more complex approaches [35].

    Deep learning models: long short-term memory and gated recurrent units

    LSTM is a Recurrent Neural Network (RNN) architecture designed to capture longterm dependencies in time series data. LSTMs address the vanishing gradient problem found in traditional RNNs by introducing memory cells regulated by three gates: the input gate, forget gate, and output gate [36]. This mechanism allows LSTMs to retain and selectively forget information over extended periods, making them particularly effective for time series forecasting tasks that require modeling complex temporal dependencies based solely on historical data. In the case of malaria incidence prediction, the disease exhibits intrinsic temporal patterns influenced by seasonality, cyclic behavior, and historical trends, even when external factors are not explicitly included in the model [11, 37]. LSTM networks are well-suited to capture these patterns because they are designed to handle non-linear relationships and long-term dependencies in sequential data [12]. Unlike traditional time series models, such as ARIMA, which assumes linearity and requires stationarity, LSTMs can model complex and dynamic fluctuations without the need for prior data transformation [38]. This capability makes LSTM a robust choice for predicting malaria cases from historical notification data, enabling the identification of trends and future outbreaks with higher accuracy.

    GRUs are a simplified variant of LSTMs, designed to reduce computational complexity while maintaining similar performance. GRUs merge the input and forget gates into a single update gate, making them more efficient in processing without sacrificing accuracy [27]. GRUs have been used successfully in infectious disease prediction, including malaria [27, 39, 40].

    Like LSTMs, GRUs are capable of handling long-term dependencies but with fewer parameters, making them suitable for tasks with large datasets or limited computational resources.

    Machine learning models: support vector regression, random forest, and eXtreme gradient boosting

    SVR is a supervised machine learning technique that models non-linear relationships in time series data. SVR extends the Support Vector Machine (SVM) classification method by estimating a real-valued function, which is useful for continuous predictions, such as forecasting the number of malaria cases [41]. SVR’s ability to handle high-dimensional data and incorporate external variables makes it a valuable tool for complex forecasting tasks [28, 29].

    SVR works by constructing a hyperplane that minimizes prediction error and model complexity, relying on a small subset of the training data, known as support vectors [30].

    RF is an ensemble learning algorithm that constructs multiple decision trees using random subsets of data and features, combining their predictions for a final output [42]. This approach is robust against overfitting and is well-suited for high-dimensional datasets. In malaria prediction, RF has been used to analyze various factors influencing disease spread and forecast future incidence [31, 32].

    XGBoost is a scalable and efficient implementation of gradient boosting algorithms. It is widely recognized for its superior performance in structured data and has been used extensively in time series forecasting and classification problems [34]. XGBoost operates by iteratively adding weak learners (typically decision trees) to minimize the residual errors of the previous models [33].

    One of XGBoost’s key advantages is its ability to handle missing data and optimize model complexity using regularization techniques [34]. The algorithm’s computational efficiency and scalability make it attractive for large datasets, allowing rapid iteration and fine-tuning [33].

    Autoregressive Integrated Moving Average (ARIMA)

    ARIMA is a widely used time series analysis and forecasting technique applicable in fields such as economics, finance, healthcare, and meteorology. The model uses past values of a time series to predict future values by fitting a mathematical model to the data. ARIMA has been employed to forecast various phenomena, including stock prices and disease outbreaks [35].

    The ARIMA model consists of three components: autoregression (AR), differencing (I), and moving average (MA). The AR component models the current value of the time series based on its past values. The I component ensures stationarity by differencing the time series, stabilizing its mean and variance. The MA component models the errors as a linear combination of past errors [35].

    Evaluation metrics

    The models were trained and tested using temporal and geospatial subsets of the dataset. Each model’s performance was evaluated based on the following metrics Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) [43].

    RMSE measures the square root of the average of the squared differences between actual and predicted values, as defined in the following Eq. 1:

    $$RMSE=\sqrt{\frac1T\sum\nolimits_{t=1}^T(y_t-{\widehat y}_t)^2}$$

    (1)

    where yt is the actual value, ˆyt is the value predicted by the model, and T is the value given the number of samples of model errors [44].

    Conversely, the MAE provides a straightforward calculation of the average absolute differences between actual and predicted values, as expressed in the following equation:

    $$MAE=\frac1n\sum\nolimits_{i=1}^n\left|y_i-{\hat y}_i\right|$$

    (2)

    MAE is generally less sensitive to outliers than RMSE, as it does not square the errors. This characteristic makes it useful for providing a direct measure of the magnitude of prediction errors without excessively penalizing larger discrepancies.

    When assessing time series forecasting models, it is important to compare RMSE and MAE within the context of the specific problem. Ideally, both metrics should be minimized to achieve accurate and reliable predictions. These metrics offer insights into model accuracy, enabling an assessment of how closely predicted values align with actual values and providing an understanding of the average magnitude of errors. This information is vital for selecting the most appropriate computational model to ensure dependable and precise predictions.

    By comparing these metrics across models, the study aims to identify the most effective predictive approach for malaria cases. This evaluation also considers how geospatial clustering using K-means impacts model performance by accounting for regional transmission patterns.

    Models’ configuration

    To find the best configuration for each model, a holdout validation method was used, where 80% of the historical data was allocated for training and 20% for testing. Each technique was evaluated in 30 iterations to ensure statistically robust results, with the average metrics RMSE and MAE used as performance indicators.

    Computational models require the configuration of multiple hyperparameters, which are crucial for their performance. However, manual tuning of these parameters is often impractical due to the vast search space. To optimize hyperparameters for each model, we employed two strategies: Grid Search and Optuna.

    Grid Search is an exhaustive search technique that systematically trains and evaluates models across all possible combinations of hyperparameters within a predefined search space. The combination that delivers the best performance is selected as the optimal configuration [45, 46].

    Optuna was specifically used to optimize the deep learning models (LSTM and GRU). It is an open-source framework designed for hyperparameter optimization that allows users to define the search space dynamically. Optuna aims to minimize or maximize an objective function until the optimal value is achieved. To avoid overfitting during optimization, Optuna incorporates a regularization mechanism called pruning, which halts unpromising trials early in the process [47].

    Table 1 summarizes the hyperparameters used during the optimization process for all models. This dual approach to hyperparameter tuning ensures that each model achieves its best possible performance while maintaining generalizability and robustness.

    Table 1 Search space for different models and techniques



    Source link

    Leave a Reply

    Your email address will not be published. Required fields are marked *