CNNs’ remarkable performance in image classification and object recognition tests prompted academics to study the networks’ potential for semantic segmentation tasks. Since then, numerous architectural ideas have been published in academic journals, demonstrating promising results in a variety of fields of expertise. We chose the PSPNet semantic segmentation architecture from a pool of accessible options for this project. Several CNN architectures have arisen, each with its own set of properties. Regardless of these variances, they all share a common goal: to improve accuracy while simultaneously reducing model complexity. Certain architectures deliver great performance across a wide range of applications. GoogleNet39, VGG1640, ResNet5041, and AlexNet42 are some of the popular options for addressing the plant disease categorization challenge. Following that, we present the suggested architecture, which is divided into three stages: a semantic segmentation phase learned on the segmentation dataset, feature extraction, and classification phase trained on the symptoms dataset. The initial model’s segmented symptoms are then independently processed and categorized within the second model. As a result, this method allows for the identification of many symptoms as well as the precise assessment of their severity levels. The overflow of maize leaf disease identification methods is described in Fig. 2.

Overflow of Maize leaf diseases identification methods.
Data collection and pre-processing
This paper used maize disease images from the PlantVillage dataset in this work, which are distinguished by their uniform backgrounds in images demonstrating various maize diseases. First, the dataset is separated into two classes: diseased and healthy. The collected images are classed and tagged. The annotated images were used to train the PSPNet model, while the categorized class images were utilized to categorize the various maize illnesses. All of the photographs were collected in both JPEG and PNG formats. The dataset contains images of several diseases affecting maize plants, allowing for the development and testing of machine-learning models for identifying and diagnosing these diseases in maize crops. The dataset included 1532 images displaying Common Rust, 1430 images depicting Southern Rust, 1139 images depicting Grey Leaf Spot (GLS), 574 images depicting Maydis leaf blight (MLB), and 456 images depicting Turcicum Leaf Blight (TLB) leaf disease. There were also 1587 images in the dataset that depicted the healthy state of maize leaves. The leaf image is resized 224 × 224×3 in this work and then utilised to test the performance of the suggested model. Figure 3 shows the samples of pre-processed images.

Samples of pre-processed maize leaf diseases.
In this research, the Gaussian blurring technique is used for noise reduction. It convolves the input image with a Gaussian kernel, averaging pixel values in local neighborhoods. This reduces the impact of random noise and slight visual fluctuations, resulting in cleaner input data for the neural network. The Gaussian blurring technique is often used in conjunction with a Gaussian kernel in a convolution procedure as shown in Fig. 4. The Gaussian kernel functions in two dimensions. Gaussian blurring can be calculated as follows:
$${G}_{blurred}\left(u,v\right)= \sum_{i=-\infty }^{\infty }\sum_{j=-\infty }^{\infty }B\left(u+i,v+j\right). A(i,j)$$
(1)
where:\({G}_{blurred}\left(u,v\right)\) is the pixel value in the blurred image at coordinates (u,v), B(u + i,v + j) indicates the pixel value in the original input image at coordinates (u + i, v + j), A(i,j) is the Gaussian kernel value at coordinates (i, j).

Perform image pre-processing through Gaussian filter.
A(i,j) is the Gaussian kernel defined as:
$$A\left(i,j\right)=\frac{1}{2\pi {\sigma }^{2}}.e\frac{-{i}^{2}+{j}^{2}}{2{\sigma }^{2}}$$
(2)
where: \(\sigma\) is the standard deviation of the gaussian distribution.
By smoothing out minor changes in pixel values, Gaussian blurring helps to reduce noise, resulting in cleaner and more accurate data. To improve the interpretability of learnt features, Gaussian blurring can be used to intermediate feature maps in deep neural networks. Smoothing these feature maps could help you visualise and comprehend the network’s representations.
Challenges of data preprocessing
-
The internet source dataset includes irrelevant or misleading information, such as watermarks that are logos, or text, which can lead to confusion about the authenticity of the photos.
-
The noise in visual analysis can drastically reduce efficiency and efficacy. Electronic equipment and unique lighting effects cause this type of interruption. The method of prediction may be prevented if an image of a leaf contains multiple types of noise, such as Gaussian noise, pulse noise, salt and pepper noise, and so on13.
-
Determining the optimum image size is a tough part of this research investigation. Although every bit of visual data that comprises a large image may have a considerable impact on the viewer, the combined effect of all of that information can be quite enormous29.
Data augmentation
Data augmentation is an essential approach in Machine learning (ML) and Deep Learning (DL) that is used to increase the size and diversity of training datasets. It requires making several changes to the original data and generating new samples with minor changes while preserving the underlying patterns and attributes. In the context of image data, these transformations can include rotations, flips, scaling, cropping, changes in brightness or contrast, and more. The model gets exposed to a greater range of variables as the dataset is expanded, making it more robust and capable of generalising to previously unknown data. This method is particularly effective when the original dataset is tiny, since it prevents overfitting while also improving the overall performance and dependability of machine learning models.
-
1.
Flipping: Flipping is a computer vision and image processing technique that involves reversing the horizontal or vertical pixel values of an image, resulting in a mirror image. By flipping the original image vertically or horizontally, this filter copies it.
-
2.
Zooming: The zooming is done at random, with the degree of zoom calculated independently for each image and limited to 10%.
-
3.
Cropping: By selecting alternative cropping windows or methods at random for each iteration, the dataset’s diversity was improved.
-
4.
Rotation: Images are rotated at random by an angle within a predetermined range. For full rotations, the range of rotation angles is limited to − 90 to + 90 degrees in this research.
The unequal number of samples has an impact on model recognition accuracy. As a result, four general methods are used to augment a small number of sample data: random rotation, flipping, zooming and cropping. To avoid severe deformation of the converted images, the displacement of the key points in the point of view transformation has been limited to less than 10% of the image’s side length. The size of the limited number of sample data is raised by four times. Through data augmentation, the dataset increases with 8,443 images. The dataset was divided into two parts: 80% for training and 20% for testing the model’s performance. From the training subset, a validation split of 20% of the training data was taken. To learn the intricate aspects of the images, the model is fed the training subset. The validation subset, on the other hand, is separated from the training subset data and is used to monitor the model’s performance. This is accomplished by feeding it validation information after every training epoch and analyzing its performance. Following the training phase, the test subset is used to evaluate the model’s overall performance on data it has never seen before.
Outline of PRFSVM model
The PRFSVM model combines three fundamental components for image analysis viz. PSPNet, ResNet50, and Fuzzy SVM. This model is intended to solve several aspects of image analysis, such as image segmentation, classification, and uncertainty handling using Fuzzy SVM approaches. It means that PRFSVM is about more than just integrating these three components; it’s about providing a comprehensive solution for a wide range of image analysis.
Fundamental architecture of PSPNet
In this study, we have employed well-established semantic segmentation architectures found in existing literature: PSPNet. The Pyramid Scene Parsing Network (PSPNet) architecture, as introduced by Zhao et al.43 was used. PSPNet is a semantic segmentation network specifically intended for segmenting complex scenarios in which complete global context information is critical for identifying related items. PSPNet, or Pyramid Scene Parsing Network, employs a multi-step strategy to successfully segment leaf diseases44,45. It starts with a maize leaf image as input and then uses a number of convolutional layers to extract complex details from the image. These traits capture information ranging from fine surface aspects of the leaf to broader contextual information. PSPNet, crucially, has a pyramid pooling module that collects features at different spatial scales. This enables the network to grasp both local and global context inside the image, which is critical in distinguishing between disease-affected and healthy regions. Following contextual enrichment, the network performs semantic segmentation, assigning a semantic label to each pixel in the image, indicating whether it refers to a healthy or diseased area. Subsequent post-processing techniques may improve the accuracy of the segmentation results. The end result is a segmented image in which colours or labels distinguish between distinct groups, such as healthy and diseased maize leaf parts. PSPNet’s technique is effective for segmenting leaf diseases, which aids in agricultural disease diagnosis and management. Figure 5 shows the architecture of PSPNet.

Here’s a simplified illustration of the PSPNet architecture, with a focus on the pyramid pooling module:
To retrieve the output of the pyramid pooling module PM, run the input image ‘I’ through the PSPNet model ‘PN’.
$${\text{PM}}\left( {\text{I}} \right) \, = {\text{ PN}}\left( {\text{I}} \right)$$
(3)
The pyramid pooling module ‘PM’ gathers context information at many scales by operating on the output of the preceding levels. PSPNet typically divides the feature map into many regions at various scales (for example, 1 × 1, 2 × 2, 3 × 3, and 6 × 6). Average pooling is used to generate a context vector for each region. The context vectors are then concatenated and upsampled to the original resolution. Context information from several scales is concatenated in the feature map.
The pyramid pooling module’s concatenated feature map is then fed through further convolutional layers and softmax activation to obtain the final semantic segmentation map.
$${\text{PD }} = \, \left\{ {{\text{softmax}}} \right\}(\left\{ {{\text{conv}}} \right\}\left( {{\text{PM}}\left( {\text{I}} \right)} \right)$$
(4)
where ‘PD’ is the output tensor indicating the probability distribution of each pixel’s membership in several semantic classes.
Feature pyramid network
ResNet50, an abbreviation for Residual Network with 50 layers, is a ground-breaking deep convolutional neural network architecture that has considerably advanced the field of computer vision and image recognition. ResNet-50, introduced in 2015 by Kaiming He et al. 46, is unique in its capacity to address the vanishing gradient problem, a long-standing challenge in deep neural networks. The duo operation of “7 × 7 conv 64, stride 2” followed by “3 × 3 max-pooling, stride 2” at the beginning of the ResNet-50 architecture is a critical milestone in picture feature extraction. The operation “7 × 7 conv 64, stride 2” utilizes a 2D convolutional layer with a 7 × 7 kernel and 64 filters, with a stride of 2. This first convolutional layer serves a dual purpose by applying filters to the input image, capturing crucial visual elements while also lowering the spatial dimensions of the feature maps. Following that, the “3 × 3 max-pooling, stride 2” operation refines the feature extraction procedure even further. It down-samples the feature maps by using a 3 × 3 pooling window with a stride of 2, retaining the most prominent features while removing extraneous information. This gradual reduction in spatial dimensions is an important part of ResNet50’s architecture, as it allows the network to learn increasingly abstract and sophisticated properties in succeeding layers. Figure 6 shows the layering architecture of ResNet50 and Table 1 shows the Layers parameters of ResNet50.

Layering architecture of RESNet50.
Here’s a simplified representation of the ResNet-50 feature extraction process:
Deviations:
-
Image input: I with dimensions (Ht, Wd, Ch) (height, width, and channel count).
-
ResNet-50 model: RM with many layers.
-
FEL is the ResNet-50 model’s feature extraction layer.
The feature extraction procedure is illustrated below:
-
1.
Passing Forward Network:
-
2.
Extraction of features:
-
The output FEL(I) at layer (FEL) will have spatial dimensions Ht, Wd, and Ch, where (Ht’ < Ht) and (Wd’ < Wd) are high-level features learned by the network for the given input image.
-
3.
Representation of Characteristics:
-
Flatten the feature map into a 1D vector after obtaining (FEL(I).
$$[{\text{FEL}}\left\{ {{\text{text}}\left\{ {{\text{flat}}} \right\}} \right\}\left( {\text{I}} \right) \, = {\text{ text}}\left\{ {{\text{flatten}}} \right\}\left( {{\text{FEL}}\left( {\text{I}} \right)} \right]$$
(6)
-
To minimise the spatial dimensions while retaining key properties, use pooling techniques such as average pooling or max pooling with the Eq. (7).
$$[{\text{FEL}}\left\{ {{\text{text}}\left\{ {{\text{pooled}}} \right\}} \right\}\left( {\text{I}} \right) \, = {\text{ text}}\left\{ {{\text{pooling}}} \right\}\left( {{\text{FEL}}\left( {\text{I}} \right)} \right]$$
(7)
Multi-class classification network model
Fuzzy Support Vector Machines (Fuzzy SVM) are employed in determining the severity of maize leaf diseases, especially when disease severity labels are not strictly binary but include a range of severity levels. Traditional SVMs are incapable of dealing with nuanced and inaccurate severity evaluations. Fuzzy SVM, on the other hand, adds fuzzy logic to the SVM framework, allowing for the representation of uncertainty in severity labels. This is useful in the context of maize leaf diseases, where disease severity might range from mild to severe in different cases. Fuzzy SVM assigns membership values to each severity level, capturing a sample’s degree of belonging to various classes or severity categories.
Fuzzy SVM uses fuzzy membership functions to indicate the degree to which data points belong to distinct classes 47. The Gaussian membership function, which is frequently used in Fuzzy SVM, assigns a membership value to each data point in each class and has the following Eq. (8):
$${w}_{ij}\left(x\right)=e\left(\frac{-{\left|{x}^{-x}ij\right|}^{2}}{2{\sigma }_{i}^{2}}\right)$$
(8)
In this Eq. (8), the degree to which data point x belongs to class i is denoted by wij(x). It is computed by taking the Euclidean distance between x and the mean \({w}_{ij}\) of the Gaussian function for class i, scaled by the spread parameter \({\sigma }_{i}\).
Fuzzy SVM seeks a hyperplane that maximizes the margin between classes while taking into account fuzzy memberships and slack variables (\(\varepsilon\) ij):
$${V}_{i}\left(w{x}_{i}+c\right)\ge 1-{\xi }_{i\dot{j}}$$
(9)
here \({\xi }_{i\dot{j}}\ge 0\)
In this modified objective function, w denotes the hyperplane’s weight vector, and c the bias term. The fuzzy slack variables ( \({\xi }_{i\dot{j}}\) ) are introduced to allow certain data points to fall inside the margin of error or even be misclassified, reflecting the data’s fuzzy character.
Fuzzy SVM’s decision function uses fuzzy memberships and dual variables (\({\alpha }_{ij}\)) obtained during optimisation to forecast new data points in Eq. 10.
$$f\left(x\right)=\sum_{i=1}^{G}\sum_{j=1}^{H}{\alpha }_{ij}{y}_{i}{u}_{ij}\left(x\right)\left(wx+c\right)$$
(10)
Here,\({\alpha }_{ij}\) denotes the dual variables determined throughout the optimisation procedure, and \({y}_{i}\) denotes the class labels. The decision function calculates a weighted sum of fuzzy memberships for each class. The dual variables (\({\alpha }_{ij}\)) indicate the relevance of each data item and its membership in class i. The resulting value aids in classifying fresh data items based on their fuzzy affiliations.
Advantages of FSVM
-
The purpose of fuzzy membership is to lessen the impact of noise or outliers, and various fuzzy membership functions have varying effects on various categorization classes13.
-
The aim of Support Vector Machines (SVM) is to find the best hyper plane to partition the feature space while maximizing the classification margin47.
-
Every training sample can have a weight estimated by FSVM. To mitigate the effects of imbalanced datasets, FSVM steers clear of some low-weight samples, or noise samples, when building the classification hyper plane47.
Ethical approval and consent to participate
No ethical approval is required, and the authors consent to participate in the paper.
Consent for publication
Authors provide support for publication.