The study grew out of the group's previous work published in Scientific Reports, which used CNNs to quantify the number of cells present in microscopic images.14. Our regression algorithm showed good performance and accuracy in two out of three strains tested, indicating that not all cells can be quantified equally with this technique. Therefore, in this paper, we present the development of a model that can identify which cell lineages are present in each image based on a classification algorithm. CNNs are widely used for image data and are constructed through convolutional layers and apply filters to detect specific features within an image region. These characteristics are combined and processed into subsequent layers, such as pooling layers and fully connected layers, to perform tasks such as classification, object detection, and segmentation. Despite being a “simple” structural model, we were able to solve the problem and no complex modifications were required.
image database
The images used were acquired in a project analyzed by the Harmony software (version 3.5) integrated into the automated microscopy High Content Screening (HCS). Only phase contrast images were selected. Images of A549, HUH7_denv, 3T3, VERO6, THP1, SH-SY5Y, A172, and HUH7_mayv cell lines were used. Light contrast adjustment (highlighting nuclear markings) and background correction (setting image background) were performed in Harmony.
Processing environment
We used Google Colab's integrated development environment (IDE) due to its large memory (currently available with 12.72 GB RAM and 107.77 GB HD). For processing purposes, I imported some libraries from the Python v9 programming language. Data (including all data, proprietary materials, documentation, and code used in the analysis) is available at Dataset: Ferreira, EKGD & Silveira, Guilherme F. 2023. Article: 1.0.0”. Zenodo. https://doi.org/10.5281/zenodo.8415315, accessible via link: https://zenodo.org/badge/latestdoi/701446984.
Segmentation and growth of image banks
Data augmentation techniques were used to increase the number of images in the database. Similar to the scaling technique, the image orientation was changed (0°, 90°, 180°, or 270°) and the image was reduced to 75%, 50%, and 25% of the original image size (Fig. 2). Images were resized to 200 × 200 pixels to enable analysis by the algorithm. All these images were stored in one database.
Kernel application before template
There was uniformity between the images, and the model sometimes had difficulty distinguishing them. To get around this situation, we applied filters to highlight some of the most relevant features of some images. This was only performed for the SH-SYS5, HUH7_mayv, HUH7_denv, and A549 lines (Figure 4). Several kernels were tested, but we found that the best results were obtained with Sharpen Kennel, which emphasizes the edges of the image. Similar to the edge detection kernel with core value 5, it adds contrast to edges and emphasizes bright and dark areas of a 3 × 3 matrix.28.

Add a kernel to image preprocessing. (be) Kernel sharpening is applied to the image. (b) Images of strains SH-SY5Y, HUH7_mayv, HUH7_denv, and A549 after kernel application.
Model validation
For CNN validation, 10% of the images were randomly removed and the remaining 90% were used for training and testing. Of these images, approximately 70% were used to train the CNN and 30% were used to test the CNN. Table 2 shows the number of images in each bank.
classification model
Images were saved and identified by the name of their strain. To create classes, the name of each lineage is replaced with an integer value, which is used to create a categorical class ranging from 0 to 7.
Evaluate the model based on accuracy metrics
Four possible outcomes were considered to evaluate the accuracy of the classification model. These were true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
confusion matrix
The confusion matrix measures the number of correct classifications of the model relative to the total number of observations. TPi corresponds to the number of false positives in class i.
N Total number of observations.
$$\frac{(TP1+TP2+\dots +TPn)}{N}$$
accuracy
Accuracy is the number of correct classifications of the model over the total number of observations.
FNi corresponds to the number of false negatives in class i.
$$\frac{TPi}{(TPi+FNi)}$$
recollection
Recall is the ratio of true positives to the total positive observations in a class.
FNi corresponds to the number of false negatives in class i.
$$\frac{TPi}{(TPi+FNi)}$$
F1 score
The F1 score is the harmonic mean of precision and recall and attempts to balance the two metrics of an imbalanced model.
$$\frac{2*Precision*Recall}{(Precision+Recall)}$$
ROC curve
The receiver operating characteristic curve (ROC) is a graphical representation of the performance of a classification model in relation to true positives (true positive rate (TPR) and false positives (FPR)). The ROC curve is constructed by plotting the TPR as follows: Function of FPR at different classification thresholds.
$$TRP= \frac{TPi}{(TPi+FNi)}$$
$$FRP=\frac{FPi}{(FPi+TNi)}$$
regression model
As a goal, we recorded the number of cells corresponding to each image from the HCS. This was used as the observed value to perform supervised training of the model, followed by reduction on the same proportion of images to perform testing against the predicted value.
Evaluate the model based on accuracy metrics
The mean absolute error (MAE), mean squared error (MSE), and R2Score were used to evaluate the capacity and extent of model accuracy and error. However, only MSE was used during model training.
MSE is \(\frac{1}{n} \Sigma_{i=1}^{n}\) square of \(\left( {Y_{i} – \hat{Y}_{i} } \right)^{2}\)
$$MSE=\frac{1}{n}\Sigma {\left(y-\widehat{y}\right)}^{2}$$
CNN
The first layer (Conv2D) had kernel_size = 3 and was fitted with an activation function Rectified Linear Unit (ReLU). Other activation functions (LeakyReLU, Tahn, and Sigmoid) were also tested, but ReLU showed the best performance. The same parameters were used in a sequence of MaxPooling2D layers, resulting in a softmax output of 8 classes. The same settings were used for the regression model, and the last layer of the network was modified to end up with only one output neuron using a ReLU activation function representing the number of cells in the image. To summarize model information, we used the model.summary() method (Table 3).
