Bayesian optimized CNN ensemble for efficient potato blight detection using fuzzy image enhancement

This study presents the implementation of a Convolutional Neural Network (CNN) model developed with TensorFlow 2 to classify potato blight. The data set, sourced from Kaggle Plant Village Dataset⁴⁶, includes 1000 images each of early and late blight and 152 images of healthy potato leaves, highlighting a significant data imbalance. The subsequent subsections discuss the methods applied for data balancing and augmentation in this work.

Data preparation: fuzzy enhancement and data augmentation

For experimentation, 80% of the data set was used for training (of which 10% was considered for validation) and 20% for testing. To enable more effective processing, the photos were scaled to 128 by 128 pixels. In this paper, we have implemented a fuzzy-based contrast enhancement technique⁴⁷ as shown in Algorithm 1 to improve the visual quality of potato leaf images. This method applies fuzzy logic to adjust pixel intensities adaptively, enhancing details and contrast in the image. The algorithm normalizes the values of the image pixel to the range [0, 1], processes them using a fuzzy enhancement formula, and then scales them back to the original intensity range. This enhancement technique amplifies the contrast of the pixels around the midpoint, making subtle variations in the details of the image more pronounced while ensuring the values remain within valid intensity limits. Figure 1 shows the Fuzzy Enhancement process and sample of images after processing. Before performing Fuzzy enhancement, we have also carried out Data Augmentation to balance the data set by increasing healthy potato leaves (6 times). Figure 2 shows the class-wise distribution before and after data augmentation.

CNN

Convolutional Neural Networks (CNNs) are a popular deep learning approach, particularly effective in image-based applications⁴⁸. CNNs consist of several layers, including input, convolution, activation, pooling, and fully-connected layers. The convolution layers extract features from the input data, with ReLU as the activation function. The pooling layers reduce parameters and mitigate overfitting, while the fully connected layers make predictions, with classification finalized using a softmax classifier.

Input layer

The fuzzy enhanced Potato Blight Image Dataset enters the neural network through this layer with a specification about the height, width, and number of color channels that make up an image.

$$\begin{aligned} I \in {\mathbb {R}}^{H \times W \times D} \end{aligned}$$

(1)

where:$H$ is the height of the input image. $W$ is the width of the input image. $D$ is the number of channels (e.g., 1 for grayscale, 3 for RGB).

Preprocessing layer

This layer modifies the data using a rescaling layer (1/255) to facilitate pattern recognition in the model by normalizing the values of pixels in the pictures to the range [0,1]. To normalize pixel values to the range [0, 1], a rescaling operation is applied:

$$\begin{aligned} I_{\text {norm}}(x, y, c) = \frac{I(x, y, c)}{255}, \quad \forall (x, y) \in [1, H] \times [1, W], \quad \forall c \in [1, C] \end{aligned}$$

(2)

where I(x, y, c) represents the original pixel intensity at position (x, y) for channel c, and $I_{\text {norm}}(x, y, c)$ is the normalized pixel value.

Convolutional layer

This layer extracts features after preprocessing by performing operations such as convolutions on the input using a set of learnable filters (kernels), which aid in the detection of patterns in images such as edges, textures, or shapes. Then the ReLU activation function is applied to introduce nonlinearity to the model, and the Max pooling operation is used to reduce the spatial dimensions of the feature map. Each filter (kernel) is represented as:

$$\begin{aligned} K \in {\mathbb {R}}^{k_H \times k_W \times C} \end{aligned}$$

(3)

where $k_H \times k_W$ is the kernel size.

The convolution operation at position (x, y) in the output feature map is defined as:

$$\begin{aligned} O(x, y, f) = \sum _{i=1}^{k_H} \sum _{j=1}^{k_W} \sum _{c=1}^{C} I(x+i-1, y+j-1, c) \cdot K(i, j, c, f) + b_f \end{aligned}$$

(4)

where: O(x, y, f) is the output at position (x, y) for filter f. K(i, j, c, f) is the kernel value at position (i, j) for channel c of filter f. $b_f$ is the bias term for the filter f.

After the convolution operation, an activation function $\sigma (\cdot )$ is applied:

$$\begin{aligned} A(x, y, f) = \sigma (O(x, y, f)) \end{aligned}$$

(5)

where $\sigma$ is a nonlinear function ReLU:

$$\begin{aligned} \sigma (x) = \max (0, x) \end{aligned}$$

(6)

Flatten layer

A one-dimensional vector space is created from the multidimensional feature maps produced by the convolutional layers. The Flatten operation reshapes I into a one-dimensional vector:

$$\begin{aligned} F \in {\mathbb {R}}^{H \cdot W \cdot C} \end{aligned}$$

(7)

Each element in F is mapped from the original tensor using:

$$\begin{aligned} F(n) = I\left( \left\lfloor \frac{n}{W \cdot C} \right\rfloor , \left\lfloor \frac{(n \mod (W \cdot C))}{C} \right\rfloor , (n \mod C) \right) \end{aligned}$$

(8)

for $n = 0, 1, \dots , (H \cdot W \cdot C – 1)$.

This operation ensures that the spatial dimensions are serialized into a continuous vector for input into fully connected layers.

Fully connected layer

Also called a Dense Layer, it is used to classify blight or healthy potato leaves in our work. Dropout regularization is used, which inhibits neuronal co-adaptation and reduces overfitting.

$$\begin{aligned} Z = W X + b \end{aligned}$$

(9)

where: $W \in {\mathbb {R}}^{M \times N}$ is the weight matrix connecting N inputs to M neurons, $b \in {\mathbb {R}}^{M}$ is the bias vector, $Z \in {\mathbb {R}}^{M}$ is the pre-activation output.

Applying a non-linear activation function $\phi (\cdot )$:

$$\begin{aligned} Y = \phi (Z) \end{aligned}$$

(10)

where: $Y \in {\mathbb {R}}^{M}$ is the final output of the Fully Connected layer,

Optimizers used

In this paper we have used the following Optimizer based CNN models for comparison and construction of the ensemble model for detection of potato blight.

1.

Adam optimizer: It integrates the advantages of two other extensions of stochastic gradient descent: AdaGrad and RMSprop. It incorporates momentum to address sparse gradients and enhance convergence speed while also computing adaptive learning rates for each parameter.
2.

RMSprop optimizer It modifies the learning rate for each parameter, helping stabilize the training process. This approach is less affected by initial learning rates and is effective in non-stationary objective problems. It is particularly beneficial for recurrent neural networks (RNNs), especially in tackling the vanishing-gradient problem.
3.

SGD optimizer It is user-friendly and efficient, especially when combined with momentum. However, careful tuning of the learning rate and other hyperparameters is essential. Compared to adaptive optimizers, a basic gradient descent method may require more adjustments to achieve optimal performance.
4.

Adamax optimizer It is specifically designed to manage situations where gradient magnitudes may become excessively large. This approach is especially beneficial when working with sparse or noisy gradients.

Bayesian optimization

Bayesian optimization is a powerful strategy for optimizing objective functions that are expensive to evaluate and lack analytical expressions. It builds a probabilistic model, typically using Gaussian processes, to approximate the unknown function and systematically explore the search space⁴⁹. By leveraging prior knowledge and incorporating uncertainty in the predictions, Bayesian optimization focuses on evaluating the most promising areas of the space, balancing exploration and exploitation. This approach is particularly useful in applications like hyperparameter tuning in machine learning, where each evaluation can be computationally costly. Its iterative process ensures that optimal solutions are identified efficiently with minimal evaluations. In this work, Bayesian optimization is used to optimize the weights of models in an ensemble. Our goal is to find the optimal set of weights $w = \{w_1, w_2, …, w_n\}$ that maximize the accuracy of the ensemble model.

Phase 1: Defining the objective function

Let $A(w)$ denote the accuracy of the ensemble model with weights $w$, where $w_i$ represents the weight assigned to model $i$. The objective is to maximize $A(w)$.

$$\begin{aligned} \text {Maximize} \quad A(w) = \text {Accuracy of Ensemble}(w) \end{aligned}$$

(11)

The ensemble model’s accuracy $A(w)$ depends on the weighted combination of predictions from $n$ individual models. This can be represented as follows.

$$\begin{aligned} A(w) = \frac{1}{N} \sum _{i=1}^{n} w_i \cdot A_i \end{aligned}$$

(12)

Where: $w_i$ is the weight assigned to model $i$. $A_i$ is the accuracy of model $i$ in the validation set. $N$ is the number of models in the ensemble.

Phase 2: Probabilistic model (surrogate model)

To approximate the unknown accuracy function $A(w)$ and handle the computational cost of evaluating it, we use a Gaussian Process (GP) model. This model gives a probabilistic distribution of the possible values of $A(w)$, with predictions of mean and variance.

The Gaussian process is defined as follows.

$$\begin{aligned} A(w) \sim {\mathcal{G}\mathcal{P}}(m(w), k(w, w’)) \end{aligned}$$

(13)

Where: $m(w)$ is the mean function and $k(w, w’)$ is the covariance function (kernel) for the weights $w$ and $w’$.

For a new weight vector $w^*$, the predicted mean and variance are as follows:

$$\begin{aligned} \mu (w^*) = {\mathbb {E}}[A(w^*)] \quad \text {and} \quad \sigma ^2(w^*) = \text {Var}[A(w^*)] \end{aligned}$$

(14)

Phase 3: Acquisition function

An acquisition function is used to guide the search for the optimal weights. A common acquisition function to maximize $A(w)$ is expected improvement (EI). Quantifies the expected improvement over the best current observed accuracy $A_{\text {best}}$.

$$\begin{aligned} \text {EI}(w^*) = {\mathbb {E}}[\max (A(w^*) – A_{\text {best}}, 0)] \end{aligned}$$

(15)

Where $A_{\text {best}}$ is the best accuracy observed so far. The acquisition function balances the exploration of uncertain regions of the weight space and the exploitation of regions with high predicted accuracy.

Phase 4: Iterative optimization process

The optimization process proceeds iteratively:

1.

Select the next evaluation point: Based on the acquisition function, select the next set of weights $w^*$:

$$\begin{aligned} w^* = \arg \max _{w \in W} \, \text {EI}(w) \end{aligned}$$

(16)
2.

Evaluate the objective function: Evaluate the ensemble accuracy $A(w^*)$ at the selected weight set $w^*$, using the weighted ensemble of models.
3.

Update the surrogate model: Update the Gaussian Process with the new evaluation $(w^*, A(w^*))$. This improves the GP’s predictions for future evaluations:

$$\begin{aligned} {\mathcal{G}\mathcal{P}}(w^*, A(w^*)) \rightarrow {\mathcal{G}\mathcal{P}}(w, A(w)) \end{aligned}$$

(17)
4.

Repeat: Repeat the process until a stopping criterion is met (e.g., convergence or maximum number of evaluations).

Phase 5: Convergence

The optimization process converges when no further significant improvement in accuracy is observed. The final result is the optimal weight set $w^*$ and the corresponding maximum accuracy $A(w^*)$.

Ensemble of CNN models

Ensemble learning seeks to improve predictive accuracy by merging the outputs of various models trained on the same dataset. The fundamental concept is to intelligently combine base models to construct a more reliable composite model. This method is effective in reducing model variance and error, often outperforming individual models. When applied to deep CNN structures, the ensemble techniques blend the features extraction strengths of each model, resulting in superior generalization capabilities⁵⁰. Popular ensemble methods include bagging, stacking, voting, and averaging predictions, with Average Ensemble being particularly common for classification problems. Instead of the conventional Average Ensemble, our strategy boosts model significance via a weighted ensemble technique, where optimal weights are derived using the Bayesian optimization process. The weighted average ensemble strategy, as illustrated in Fig. 3, integrates the results of multiple models by assigning differential weights derived from Bayesian optimization. The following steps are involved in this process:

1.

Base CNN models based on different Optimizers are trained on a Potato Blight dataset.
2.

Every model produces its prediction score for the test data.
3.

Allocate weights to models derived from Bayesian optimization to enhance accuracy.
4.

Calculate the final prediction by combining the separate prediction models in a weighted manner.

Mathematically, the final prediction of the ensemble ${\hat{y}}$ is given by:

$$\begin{aligned} {\hat{y}} = \sum _{i=1}^{N} w_i \cdot {\hat{y}}_i \end{aligned}$$

(18)

where: $N$ is the number of models, $w_i$ is the weight assigned to the $i$-th model, ensuring $\sum _{i=1}^{N} w_i = 1$, ${\hat{y}}_i$ is the prediction of the $i$-th model.

This approach ensures that models with higher reliability contribute more to the final decision, improving overall accuracy.

Performance measurement metrics

To assess deep learning models, various performance metrics are employed. The confusion matrix serves as a fundamental tool for computing accuracy, precision, recall, and the F1 score. The mathematical expressions for these metrics are presented in Equations (4) through (7):

$$\begin{aligned} & \text {Accuracy} = \frac{TP + TN}{TP + FN + FP + TN} \end{aligned}$$

(4)

$$\begin{aligned} & \text {Precision} = \frac{TP}{TP + FP} \end{aligned}$$

(5)

$$\begin{aligned} & \text {Recall} = \frac{TP}{TP + FN} \end{aligned}$$

(6)

$$\begin{aligned} & \text {F1-score} = \frac{2 \times \text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$

(7)

where $TP$ stands for True Positive, $TN$ stands for True Negative, $FP$ stands for False Positive and $FN$ stands for False Negative.

Source link