Automated workflow for leukemia classification
This study aimed to develop an automated system for the processing and analysis of blood smear images to enhance the diagnostic accuracy of ALL and AML. The proposed methodology was designed to assist pathologists by facilitating the segmentation and classification of ALL and AML cells, thereby supporting faster and more precise clinical decision-making.
WBCs are classified into five main subtypes: monocytes, lymphocytes, basophils, eosinophils, and neutrophils, collectively considered healthy cells. For diagnosing ALL, lymphoid WBCs were the primary focus, while myeloid WBCs were analyzed for AML diagnosis. In blood smear images, lymphoid and myeloid cells exhibited distinguishable nuclei, which differed significantly from the surrounding background and other blood cells. The affected cells—lymphoblasts in ALL and myeloblasts in AML—underwent specific morphological changes that could be identified through computational methods.
The methodology employed a hybrid approach that combined feature extraction using pre-trained CNNs with classification performed by both traditional machine learning models and deep learning-based classifiers. Semantic segmentation was performed on blood smear images to isolate lymphoid or myeloid cells from the background and other cellular components. Features were then extracted from the segmented regions using pre-trained CNNs (e.g., VGG16, InceptionV3, ResNet50). Finally, classification was conducted using models, including RF, SVM, XGBoost, and a MLP, to categorize cells into healthy WBCs, lymphoblasts, or myeloblasts.
The workflow of the proposed system is summarized in the block diagram (Fig. 1) and included the following steps:
-
1.
Input Data: Blood smear images were sourced from the ALL-IDB and Munich AML Morphology datasets.
-
2.
Preprocessing: Semantic segmentation and data augmentation were applied to enhance image quality and diversity.
-
3.
Feature Extraction: Pre-trained CNNs were used to extract relevant image features from the segmented regions.
-
4.
Classification: Machine learning classifiers, including RF, SVM, XGBoost, and MLP, were utilized for cell categorization.
-
5.
Output: The predicted cell classifications were generated, along with performance evaluation metrics.
This suggested approach was developed to enable precise detection and classification of ALL and AML cells, providing an effective computational tool for advancing leukemia diagnostics. Subsequent sections detail the dataset sources, preprocessing techniques, feature extraction methods, and classification algorithms employed in this study.

Workflow of the proposed system, illustrating the process from data preprocessing to the final classification output.
Dataset integration and preparation
The datasets used to evaluate the suggested approach for ALL classification were sourced from the publicly available Acute Lymphoblastic Leukemia Image Database for Image Processing (ALL-IDB). This database comprises two subsets, ALL-IDB1 and ALL-IDB2, containing microscopic images of blood samples annotated by qualified oncologists. The images were captured using a Canon PowerShot G5 camera attached to an optical laboratory microscope, with magnifications ranging from 300x to 500x. All images were provided in jpg format with a 24-bit color depth10. For this study, the ALL-IDB2 dataset was selected as it includes pre-segmented cells extracted from complete microscopic images, simplifying the processing pipeline. Despite differences in image size, ALL-IDB2 maintains similar grayscale properties to ALL-IDB1, ensuring consistency within the dataset.
For AML, the data were drawn from the Munich AML Morphology Dataset, which contains expert-labeled single-cell images from peripheral blood smears of 100 AML-diagnosed patients and 100 non-malignant cases. These images were collected at the Munich University Hospital between 2014 and 2017 using an M8 digital microscope/scanner (Precipoint GmbH, Freising, Germany) at 100x optical magnification with oil immersion. Experienced professionals categorized both pathological and non-pathological leukocytes based on morphological guidelines derived from clinical practice11.
To improve the robustness and generalization capabilities of the proposed system, the study integrated the ALL-IDB2 and Munich AML Morphology datasets. This unified dataset contained a total of 390 images, with 130 images for each of the three classes: healthy cells, lymphoblasts, and myeloblasts. The balanced dataset ensures equitable representation of each class, which is critical for minimizing bias and enhancing classification accuracy. Representative images of the three cell types are shown in Fig. 2, illustrating their distinct morphological features.

Morphological characteristics of (a) healthy WBCs (130 images), (b) lymphoblasts (130 images), and (c) myeloblasts (130 images) are shown. These images are used for model training and evaluation.
Sample pre-processing
To address the limited number of microscopic blood sample images available in the datasets, data augmentation techniques were applied to artificially expand the training set. Augmentations included rotations of 60° and 90°, horizontal flips, vertical flips, and random shifts within the range of (1.0, 1.0). These transformations were chosen to simulate variations in cell orientation and positioning that occur naturally during sample preparation or microscopic imaging. By increasing dataset diversity, these augmentations reduced the risk of overfitting and improved the model’s generalization capabilities. Examples of these augmentation techniques applied to an original training image are shown in Fig. 3.
Before augmentation, all input images were resized to 256 × 256 pixels and normalized to ensure uniformity and compatibility with the U-Net model’s input requirements. This preprocessing step standardized the dataset and improved training efficiency.
Semantic segmentation was employed to isolate WBCs from background artifacts and other cellular components. Background removal plays a critical role in bioimage classification tasks by reducing noise, enhancing focus on relevant features, and mitigating bias caused by variable or cluttered backgrounds19.
The segmentation was performed using a U-Net architecture20. The U-Net model consisted of convolutional layers with 3 × 3 filters for feature extraction, dropout layers with a rate of 0.5 to prevent overfitting, max-pooling layers for down-sampling, and transpose convolutional layers for up-sampling21. Concatenation layers combined features from different levels of the network, enabling the integration of both low- and high-level features22. The output of the model was a single-channel segmentation map representing the isolated WBCs.The augmented dataset served as input for the segmentation step, ensuring diverse and representative samples for training the U-Net model. Figure 4 demonstrates the segmentation process, where the U-Net architecture successfully isolated WBCs from the background.

Augmented data used in training.

Semantic Segmentation of White Blood Cells. (a) Original raw image. (b) Segmented image with background removal.
Pre-trained networks for feature selection
Several pre-trained CNNs, including VGG-1623, InceptionV324, and ResNet-5025, were employed in this study for feature extraction. These networks, trained on large datasets such as ImageNet, are well-suited for image classification tasks due to their robust architectures and widespread use. In this study, the feature extraction process can be represented mathematically in Eq. (1) as:
$$Z=\phi (X;\theta ),Z \in {{\mathbb{R}}^{n \times d}}$$
(1)
where:
-
\(X=\left\{ {{x_1},{x_2}, \ldots ,{x_n}} \right\}\) is the dataset, with \({x_i} \in {{\mathbb{R}}^{h \times w \times c}}\), representing an image of height h, width w, and c channels (e.g., c = 3 for RGB),
-
n is the number of the input images.
-
\(\Phi (X;\theta )\) is the feature extraction function, where θ are the parameters of the pre-trained CNN.
-
Z is the features matrix, with each row representing a d-dimensional feature vector extracted from the corresponding input image.
For feature extraction, VGG-16, a series-based CNN, processes each layer sequentially, taking input from the previous layer. In contrast, InceptionV3 uses a Directed Acyclic Graph (DAG) structure for more complex pathways between layers. ResNet-50, a residual network, leverages skip connections to address the vanishing gradient problem and maintain accuracy in deep networks. The extracted feature matrix Z enables effective representation of input data for downstream tasks such as classification. Here, d corresponds to the number of features extracted from the final or penultimate layer of the network.
Classifiers for leukemia image classification
The study employs a range of classifiers, including traditional machine learning models like RF, SVM, and XGBoost, as well as the deep learning-based MLP, to classify leukemia images. These classifiers were selected due to their complementary strengths in handling diverse data characteristics. RF is known for its robustness and ability to generalize effectively on small datasets26SVM excels in high-dimensional feature spaces27XGBoost provides scalability and effective regularization to mitigate overfitting28and MLP captures complex non-linear relationships in the feature space29.
Feature vectors extracted from the fully connected layers of pre-trained CNNs were utilized as inputs to these classifiers for final classification. The extracted features (Z) served as input to the classifiers, while classifier-specific parameters were denoted as θ. The classifiers’ performance was evaluated using precision, recall, F1-score, and accuracy to comprehensively assess their effectiveness in distinguishing between healthy cells, lymphoblasts, and myeloblasts.
Deep learning-based multi-layer perceptron (MLP)
The MLP classifier was implemented to model non-linear relationships in the feature space. Its architecture consisted of an input layer of dimension d, a single hidden layer with 128 neurons, and an output layer with three neurons corresponding to the three classes: healthy, lymphoblasts, and myeloblasts. The prediction function for MLP can be expressed in Eq. (2) as:
$$f\left( {Z;\psi } \right)={\text{softmax}}\left( {{W_h} \cdot \sigma \left( {{W_i} \cdot Z+{b_i}} \right)+{b_h}} \right)$$
(2)
where:
-
\({W_i}\) and \({W_h}\) : Weight matrices for the input and hidden layers, respectively,
-
\({b_i}\) and \({b_h}\) : Bias terms,
-
\(\sigma \left( \cdot \right)\): Rectified Linear Unit(ReLu) activation function.
The final probabilities were computed using the softmax function. Optimization was performed using the Adam optimizer with a learning rate of 0.001. Categorical cross-entropy was used as the loss function, and dropout layers with a rate of 0.5 were incorporated to mitigate overfitting. The model was trained for 50 epochs with a batch size of 32, and hyperparameters were selected based on validation performance30 .
Random forest (RF)
RF was selected for its robustness and ability to generalize effectively across diverse datasets. It consists of an ensemble of T decision trees, with predictions aggregated through majority voting. The final prediction function is defined in Eq. (3) as:
$$f\left( {Z;\psi } \right)=\frac{1}{T}\sum\limits_{{t=1}}^{T} {{h_t}(Z)}$$
(3)
where:
For this study, T was set to 100, and the maximum tree depth was left unrestricted to allow full tree growth. The minimum number of samples required to split a node was set to 2, and the minimum number of samples required for a leaf node was set to 128. Hyperparameter optimization, including T, was conducted using RandomizedSearchCV method to ensure optimal performance28,31.
Support vector machines (SVM)
Support Vector Machines (SVMs) were employed for their effectiveness in high-dimensional spaces and their ability to handle non-linear relationships using kernel functions. The SVM classifier separates data points by constructing a hyperplane defined in Eq. (4) as:
$$f\left( {Z;\psi } \right)=sign\left( {Z \cdot w+b} \right)$$
(4)
where:
Extreme gradient boosting (XGBoost)
XGBoost was chosen for its scalability and ability to handle large datasets effectively while preventing overfitting through regularization and subsampling techniques. The XGBoost model aggregates predictions from K trees, defined in Eq. (5) as:
$$f\left( {Z;\psi } \right)=\sum\limits_{{k=1}}^{K} {\eta \cdot {f_k}\left( {{\rm Z};T} \right)}$$
(5)
where:
-
\({f_k}\left( {{\rm Z};T} \right)\) : Prediction from k-th tree,
-
\(\eta\) : Learning rate,
-
K: Total number of boosting rounds.
In this study, K was set to 100 (ηestimators) and the learning rate (η) was set to 0.328. The maximum tree depth was set to 6, and the subsample parameter, controlling the fraction of samples used for training each tree, was set to 1.0. Additionally, L2 regularization was applied to reduce overfitting risks28. Hyperparameter tuning was performed using RandomizedSearchCV to achieve optimal results31.
Model evaluation and performance metrics
The classifiers’ performance was evaluated using precision, recall, F1-score, and accuracy. These metrics provided a comprehensive assessment of the classifiers’ ability to distinguish between healthy cells, lymphoblasts, and myeloblasts32.
A confusion matrix was utilized to quantify the performance of the models, detailing the number of True Positives (TP), False Negatives (FN), True Negatives (TN), and False Positives (FP) for each class32. These values were then used to compute the following metrics:
-
1.
Precision:
Precision measures the proportion of correctly classified positive samples to the total number of predicted positive samples. It is defined in Eq. (6) as:
$${\text{Precision}}=\frac{{TP}}{{TP+FP}}$$
(6)
-
2.
Recall:
Recall, also known as sensitivity, calculates the proportion of correctly classified positive samples to the total number of actual positive samples. It is defined in Eq. (7) as:
$${\text{Recall}}=\frac{{TP}}{{TP+FN}}$$
(7)
-
3.
F1-Score:
The F1-score is the harmonic mean of precision and recall, providing a balanced measure of the model’s performance, particularly in the presence of class imbalance. It is defined in Eq. (8) as:
$${\text{F1}}=\frac{{2TP}}{{2TP+FP+FN}}$$
(8)
-
4.
Accuracy:
Accuracy represents the overall correctness of the model and is defined in Eq. (9) as:
$${\text{ACC}}=\frac{{TP+TN}}{{TP+TN+FP+FN}}$$
(9)
These metrics provided insights into the strengths and weaknesses of each classifier. Precision and recall were particularly critical for evaluating the classifiers’ ability to identify lymphoblasts and myeloblasts, as false negatives in these categories could have significant clinical implications32. The confusion matrix was computed for each classifier to analyze their performance across individual classes.
