An efficient IoT-based crop damage prediction framework in smart agricultural systems

Machine Learning


This section offers the performance estimation and assessment of the prevailing suggested method.

Experimental environment

The experiments are applied using machine learning and ensemble algorithms written in Python and running on Windows 10 with Intel(R) Core (TM) i9-7940X CPU @ 3.10 GHz processor and 64.0 GB RAM.

Dataset

The suggested assay-based ML and EL models measure the outcome of the harvest season, i.e., whether the crop would be healthy (alive), damaged by pesticides, or damaged by other reasons using the agriculture data acquired from data sources17. Comprehensive information about this dataset like the number of attributes and the distribution of the dataset for training and testing based on 10-fold cross-validation is highlighted in Table 3.

Table 3 Comprehensive information of the dataset used.

The agriculture datasets are treating missing values using (imputation using the simple importer (median), imputation using kNN, and imputation using Linear Regression and Extreme Gradient Boosted Decision Trees (XGBoost)) implemented into the machine learning and ensemble algorithm for the outcome of the harvest season and predict purpose. Then, using Bayesian optimized for the highest accuracy classifier to determine the hyperparameters for this classifier, and finally, apply the voting and weight classifiers.

Performance metrics

The estimated capacity of the ML and EL techniques is assessed using both Mean Square Error (MSE) and Coefficient of determination (R2). The MSE evaluates the average Euclidean distance between the expected and true or calculated values and is displayed as18,19:

$$\:\text{M}\text{S}\text{E}=\frac{1}{n}\sum\:_{i=1}^{n}{({\hat{y}}_{i}-{y}_{i})}^{2}$$

(1)

where \(\:{\hat{y}}_{i}\) and \(\:{y}_{i}\) are the ith predicted output and the ith the true output, individually. MSE can evaluate the accuracy of the predicted values from each model, where lower MSE shows higher accuracy.

The coefficient of determination is exploited to determine the modification of the predicted values from the noted values. In this research, the Pearson correlation coefficient18,20 is employed to specify the accuracy of the predicted results. In the case of the collected data, the Pearson correlation coefficient can be calculated as follows:

$$\:{R}^{2}=\raisebox{1ex}{$\sum\:_{i=1}^{n}({y}_{p}\left(i\right)-{y}_{p}^{-}\left)\right({y}_{t}\left(i\right)-{y}_{t}^{-})$}\!\left/\:\!\raisebox{-1ex}{$\sqrt{\sum\:_{i=1}^{n}({y}_{p}\left(i\right)-{y}_{p}^{-})2}\:\:\sqrt{\sum\:_{i=1}^{n}({y}_{t}\left(i\right)-{y}_{t}^{-})2}$}\right.$$

(2)

Where \(\:{y}_{p}\left(i\right)\) and \(\:{y}_{t}\left(i\right)\) are the ith predicted output and the ith true output, respectively. in this paper, both MSE and R of the training data are exploited to get the best degree of complexity and performance for each ML and EL model.

The performance estimates of ML and EL models in a multiclass classification and prediction task are calculated by various statistical and mathematical models employed. These evaluation metrics such as accuracy, precision, f-score, and recall are exploited to equate the achievement of the suggested classifier to existing ones.

The observation is achieved by considering true negatives (TN), true positives (TP), false positives (FP), and false negatives (FN). The accuracy of the classification model on a defined test is the ratio of the test set that is accurately categorized by the classifier. Precision is the assessment of the correctness of positive labeled examples. Recall is the measure of fullness or accuracy of positive examples and how many examples of the positive class are labeled accurately. Accuracy, Precision, Sensitivity (Recall), and F_Score, are measured as per Eqs. (3)– (6) accordingly21,22,23,24,25.

$$\:\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}\:=\frac{TP+TN}{TP+FP+FN+TN}$$

(3)

$$\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:=\frac{TP}{TP+FP}$$

(4)

$$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\:=\frac{TP}{TP+FN}$$

(5)

$$\:\text{F}\_\text{S}\text{c}\text{o}\text{r}\text{e}\:=\frac{2*\left(R*P\right)}{P+R}$$

(6)

Results analysis

Before conducting the experimental analysis, it was essential to properly prepare the data for machine learning and ensemble modeling. The real-world farming data collected from sensors exhibited non-uniform distribution, making it unsuitable for direct use in training and testing. To address this, input features were normalized, numerical columns were encoded, and categorical columns were processed using dummy encoding. These preprocessing steps ensured that the data was suitable for multiclass classification tasks.

Initially, we applied a train-test split (75% training, 25% testing) while handling missing values using mode imputation. We then used 10-fold cross-validation to further evaluate model performance. The results of these evaluations, including performance metrics for missing data imputation, are presented in Tables 4 and 5. Table 6 summarizes the outcomes of voting and weighted ensemble classifiers, while Table 7 shows the results of using a simple imputer for handling missing values. Figure 2 presents a comparative analysis of various machine learning algorithms under both train/test and cross-validation settings for crop damage prediction.

Table 4 Performance of ML models for missing data mode imputation using train and test split.
Table 5 Performance of the models for missing data mode imputation using cross-validation.
Fig. 2
figure 2

Comparative analysis between various ML algorithms for train/test split and cross validation (CV) for crop damage prediction.

Table 6 The performance of the voting and weight classified with different methods.
Table 7 The results of missing values imputation using simple imputer (SI).

The evaluation technique to integrate missing data based on the pattern and the extent of data severity missing (from little to moderate to large amounts of data affected). Experiments on various standards of datasets imitating 10% of missing values artificially articulate the effect of missing patterns in the training and testing data sets. The imputation was chosen for the most occurrence of missing, this is usually experienced during missing data imputation of agriculture’s datasets disconnection data communication. The evaluation of the best structure to build the most optimized imputation of data is executed by trial and error by adapting the parameters of the supervised ML and EL models with their tuning values.

In the meantime, for KNN, tuning parameters to impute the missing values differ from 5 to 500 nearest neighbors. When KNN evaluates missing values depending on its neighbor, it potentially is prone to over-fitting and noise-sensitivity when K is too low or covers a large value of data points away from the neighbor, and while the impute data may occur to be bias-prone when it covers more instance space. Because this imputation technique contains an exploration of the full dataset to discover the KNN, it can be pricy and suffer from poor performance, especially for a huge dataset, and the results of the imputation of missing data using KNN imputation are shown in Table 8 while results for LR in Table 9. Between the stated supervised ML and EL models, KNN has been commonly chosen and performed in data imputation applications because of its ability to maintain the value of the missing data with the value of related cases (K-similarity of attributes) from the whole dataset, Missing data are measured by identifying a diversity of K-nearest neighbors, and then, meaning the non-missing values of its neighbors. The K value is determined stochastically depending on the Euclidean distance as in Eq. (7), by computing the square root of the sum of the difference between the calculated new value \(\:{\hat{y}}_{k}\) and the novel value \(\:{y}_{k}\). Based on the dataset size and ratio of missing values, the imputation process needs good tuning to avoid vulnerability to overfitting and sensitive data points11,26,27,28,29,30.

$$\:E\left(A,B\right)=\sqrt{\sum\:_{k=1}^{N}{({\hat{y}}_{k}-{y}_{k})}^{2}}$$

(7)

Table 8 The results of missing values imputation using KNN imputer.

On the other hand, XGBoost is dependent on a gradient-boosting approach, and it utilizes the ensemble of various weak models to reach the final predictions. Because each model in the XGBoost ensemble is established in the areas of data points, the algorithm process output training yields a rational XGBoost model which may Hence, using a cross-validation procedure and testing on the hidden test set, we can obtain a generalized model execute well on all the splits of the data. From Table 9, it can be noted that XGBoost has ascertained the underlying distribution and is supplying lower errors, and the performance is substantially enhanced after data imputations. It can be implicit that the selection of the best ML and EL approaches is based on the type of missing data imputation technique used. Considering that, XGBoost shows the highest performance, for the brevity of discussions, this paper essentially details the model performance for XGBoost (Table 10).

Table 9 The results of missing values imputation using linear regression (LR) imputer.
Table 10 The results of missing values imputation using Xgboost imputer.

To compare the performance of different machine learning techniques used for imputing missing values, both Mean Squared Error (MSE) and R-squared (R2) metrics were employed. While a high R2 value (close to 1) indicates that the model fits the known data well, it does not always guarantee accurate imputation. This is evident in the case of the simple imputer, which achieved an R2 of 0.96 but was associated with a relatively high MSE of 0.1682, highlighting inconsistencies in prediction accuracy. Therefore, reliable imputation should be assessed using both high R2 and low MSE values. As shown in Table 11, the XGBoost model outperformed other methods by achieving the lowest MSE and highest R2, indicating superior imputation accuracy. Visual comparisons of the results across different imputation methods are provided in Figs. 3, 4, 5 and 6.

Table 11 Performance comparison for missing value handling.
Fig. 3
figure 3

Accuracy results in different imputation algorithms.

Fig. 4
figure 4

Precision results for different imputation algorithms.

Fig. 5
figure 5

Recall results for different imputation algorithms.

Fig. 6
figure 6

F1_Score results for different imputation algorithms.

In conclusion, this study establishes the effectiveness of supervised machine learning (ML) mechanisms and ensemble learning (EL) techniques for imputing missing data in the context of smart farming applications, particularly in predicting crop damage outcomes during the harvest season. Various ML models, including XGBoost, Linear Regression, KNN, and simple imputer, were explored. The dataset used in this study not only pertains to the harvest season outcomes (healthy, damaged by pesticides, or damaged by other reasons) but also presents a significant challenge with a substantial amount of missing data.

To address the complexities associated with missing data, multiple strategies were employed to build the presence of data gaps, enabling further agricultural event processing and subsequent analysis. The dataset exhibited variations in predicted tracking capability and performance metrics, evaluated through MSE, R2, accuracy, precision, recall, and F-score, depending on the imputation models and hyperparameter settings. Notably, the Extreme Gradient Boosted Decision Trees (XGBoost) technique, with optimal hyperparameters, demonstrated superior performance in imputing missing data, achieving low values of MSE (0.0213) and high R2 (0.99) for up to a 10% missingness ratio.

The suitability of the selected data imputation method was found to be influenced by factors such as the data pattern, missingness mechanism, data type, and the ratio of missing values—all of which impact performance estimation. This study highlights that machine learning and ensemble learning techniques outperformed traditional methods, with ensemble models showing particularly strong results compared to individual machine learning algorithms and linear regression.

Lastly, Table 12 presents a benchmarking comparison between the proposed model and recent studies. It highlights that while methods like DNN, IML, and statistical models show strengths in accuracy and interpretability, they often face limitations such as high computational cost, sensitivity to outliers, or complexity. The proposed system stands out for combining interpretable ML and ensemble learning (EL) with Bayesian Optimization, achieving high imputation accuracy. It effectively handles missing data through a hybrid ML-EL imputation approach, outperforming traditional techniques.

Table 12 Comparison between the proposed model with the recent work.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *