This study focused on an initial understanding of the data related to leptospirosis case outcomes through exploratory data analysis. The goal of this analysis was to identify patterns, trends, and outliers before applying specific learning techniques. This approach provided in-depth insight into the dataset and laid a solid foundation for subsequent implementation of the techniques.

The number of confirmed cases that have resulted in recovery or death.
The implementation of exploratory analysis is a step that precedes the application of the learning model. Figure 2 shows the whole process in detail. Initially, the mortality rate with respect to all cases was 14.66%, so it was mandatory to adopt the SMOTE model to balance the dataset. This process is essential in the context of machine learning to address the imbalance of data classes, improve the model performance, and mitigate overfitting. Furthermore, the use of augmentation in the training data with the SMOTE technique avoids data duplication and focuses on maintaining the diversity and representativeness of the original data, allowing realistic variations, improving the robustness and generalizability of the model and acting as an even better protection against overfitting.
Mortality rates among leptospirosis patients from 2007 to 2017 show an interesting trend, as shown in Figure 2. In 2007, the recovery rate was 80% and the mortality rate was 20%. After that, the cure rate steadily increased, reaching a peak of 87% in 2011, while the mortality rate decreased. The recovery and mortality rates fluctuated in the following years, indicating an evolution of the pattern during the analyzed period.
Defining a Machine Learning (ML) Model
The ML models selected to evaluate the results are Random Forest, Adaboost, and Decision Tree. These models were selected because they have been used in similar studies and have given good results compared to the different models tested. These models are widely used for this purpose, with an implementation in Python 3.7.6 and the scikit-learn, numpy, pandas, and matplotlib libraries.
Model Analysis
The obtained results were compared based on the following performance evaluation metrics: precision, recall, f1 score, MCC (Matthews Correlation Coefficient) and ROC (Receiver Operating Characteristic), as well as the confusion matrix, which is a metric that distinguishes all four types of classification performance of the binary classification models (True Positive – TP, False Negative – FN, False Positive – FP, True Negative – TN). It should be noted that one of the main goals of the classifier is to maximize the instances of TP (patients indicated by the algorithm to die and who actually died) and TN (patients indicated by the algorithm to be cured and who actually were cured), which represent the test confusion matrix of the model used. As for FP and FN of the confusion matrix, we aimed to minimize the obtained values, since they represent the errors in the classification of the possible leptospirosis outcomes.
After studying the methods applied to the data at the preprocessing stage, the results obtained in the experiments showed that some supervised machine learning models produce good classifications, depending on the attributes and hyperparameters used. As the selection of hyperparameters is also a task that directly affects the performance of the classification model, their definition was supported by a grid search method (exhaustively combining the values listed for each algorithm and evaluating all the models resulting from these combinations). Table 5 shows the hyperparameters studied for each algorithm, which allow to adjust the training of the model used.
For decision trees, values of each hyperparameter of the algorithm were evaluated: maximum tree depth (integer value in the interval {3,15}), minimum number of samples in a node (integer value in the interval {2, 15}), minimum number of leaves (integer value in the interval {2,8}), and the impurity metrics “gini” and “entropy”. Thus, to identify the optimal configuration of these hyperparameters, an exhaustive search is performed combining all possible values of these attributes, including 840 models for decision trees, 19,200 for random forests, and 1,200 for Adboost.
For the random forest, the following hyperparameter possibilities were specified to be investigated: maximum depth {3,10} (integer values in the interval {2, 16}), minimum number of samples of nodes with integers in the interval {2, 16}, number of trees used considered as values of the dataset {50, 100, and 150}, attribute evaluation criteria “Gini” and “Entropy”, and maximum number of features to be randomly drawn for each attribute evaluation criterion measured (range of values in the interval {2, 11}).
For Adaboost, the following values were observed for the hyperparameters max depth: 1, 2, 3, learning rate {0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.00}, number of “weak” estimators: {50, 100, 150, 200}, and min number of leaf samples: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
The ML models were evaluated based on their accuracy. The results shown in Table 6 (training-based) and Table 7 (validation-based) show the values obtained for the accuracy of the respective models. In both tables, the first column shows the evaluated algorithm, while the next 13 columns show each of the evaluated input attribute configurations (Table 4), taking into account the importance of the variables according to the ReliefF and CFS selection methods (Table 3).
For the training set performance (Table 6), Random Forest showed the best accuracy result (90.81) and performed better compared to other models evaluated on the training set. Experiment 8 is to remove the attribute set (Jaundice_INDETERMINATE, Vomiting_INDETERMINATE, Headache _INDETERMINATE, Myalgia_INDETERMINATE, Calf_pain_INDETERMINATE, Prostration_INDETERMINATE, Renal_insufficiency_INDETERMINATE, Respiratory_alterations_INDETERMINATE, and Fever_TRUE).
Secondly, the optimal configuration of attributes to be considered was also identified, and in experiment 8, the following were highlighted: time to doctor consultation on first symptoms, time to ELISA sample collection on first symptoms, time to hospitalization for doctor consultation, muscle pain TRUE, headache TRUE, weakness TRUE, calf pain TRUE, vomiting TRUE, jaundice TRUE, renal failure TRUE, and respiratory system changes TRUE.
Since the model must be selected taking into account the indicators obtained in the validation set, the results obtained on a validation basis, shown in Table 7, show that among the evaluated attribute combinations (Table 4), Decision Tree obtained the best result in terms of accuracy (74.29). Experiment 10 involves removing the attribute set: Jaundice_INDETERMINATE, Vomiting_INDETERMINATE, Headache_INDETERMINATE, Myalgia_INDETERMINATE, Calf_pain_INDETERMINATE, Prostration_INDETERMINATE, Renal_insufficiency_INDETERMINATE, Respiratory_alterations_INDETERMINATE, Fever_TRUE, Myalgia_TRUE, Prostration_TRUE.
Therefore, in addition to searching for the best hyperparameters per algorithm and the best algorithm among the evaluated algorithms, the optimal configuration of attributes to be considered was also identified, and in experiment 10, the most useful attributes for determining leptospirosis defects in the context of the data provided by the SINAN system were time of first symptoms and doctor consultation, time of first symptoms and collection of ELISA sample, doctor consultation, hospitalization time, headache, calf pain, vomiting, jaundice, renal failure and respiratory changes.
It should be noted that the search for optimal attributes was based on the use of filter-type variable selection techniques (CFS and ReliefF) described in the “Selection of attributes of interest” subsection under the “Methods” section.
A confusion matrix was also generated for each model obtained. The confusion matrix for the decision tree model (Table 8) displays the amount of true positives – TP (number of patients who were determined to be dead by the algorithm and actually died = 19), false negatives – FN (number of deaths x number of cures = 16), false positives – FP (number of cures x number of deaths = 10), and true negatives – TN (number of patients who were determined to be cured by the algorithm and actually cured = 25), showing the high efficiency of the model in the case of cure (n = 16). The confusion matrix is an important metric to evaluate the performance of the model generated by machine learning, because it generates other metrics (such as precision, accuracy, and recall).
As an ensemble learning technique that combines multiple decision trees to make more robust and accurate decisions, it mitigates overfitting that can occur with a single decision tree, often leading to improved performance. Random Forest with its voting mechanism can more robustly evaluate the importance of features by building multiple independent trees, while AdaBoost can adjust the weights of observations to give more weight to those that are misclassified.
Although decision trees are susceptible to overfitting, depending on the hyperparameters set, decision trees can avoid overfitting and perform well on a test basis. Furthermore, decision trees are less sensitive to hyperparameters, making them less sensitive to specific hyperparameters, easier to configure, and better performing on a testbed compared to random forests and AdaBoost. Therefore, if the most important features of a task are well represented in the first node of a decision tree model, good performance may be achieved simply through the inherent ability of decision trees to evaluate features. Having only one decision tree also offers an advantage on small data sets, as it is less susceptible to overfitting on small sets.
It is worth emphasizing that the purpose of using the validation set is to select the best algorithm/model by choosing the algorithm/model with the lowest bias error and variance on the new dataset.
