Optimizing potato yield predictions in Uttar Pradesh, India: a comparative analysis of machine learning models

The study focused on seven districts in Uttar Pradesh, India, viz, Agra, Aligarh, Etawah, Farrukhabad, Firozabad, Hathras, and Kannauj (Fig. 1). These areas were selected due to their significant contribution to the state’s potato production. The selected districts represent a diverse agro-climatic landscape, allowing for a comprehensive analysis of the impact of weather variability on potato yields. For instance, Agra features semi-arid conditions, while Aligarh and Etawah experience more subtropical climates. This variability in weather patterns across Farrukhabad, Firozabad, Hathras and Kannauj also offers a comprehensive understanding of how different weather scenarios affect potato cultivation.

Time series data of potato yield (in tons/ha) spanning 16 years (2005–06 to 2020–21) were acquired from the Directorate of Economics and Statistics, Ministry of Agriculture and Farmers Welfare¹⁹. Since potato is a seasonal crop cultivated once a year in the study regions, yield data is naturally available on an annual basis. Therefore, this study uses annual yield data from 2005 to 2021. While the dataset spans 16 years, the limited frequency of observations per district precludes the use of classical time series forecasting techniques. Instead, machine learning models, which are capable of learning complex nonlinear relationships from relatively small datasets, were employed to analyze the relationship between weather indices and potato yield. This yield series was detrended prior to performing the model analysis. Detrending the yield data is a widely used approach to account for technological advancements and isolate the impact of weather variability on yield trends. By removing these long-term trends, the analysis can more accurately isolate and understand the specific effects of climatic factors on agricultural yields. This process enhances the precision of the analysis by ensuring that the results reflect the impact of weather conditions, rather than being confounded by technological improvements. Detrending eliminates long-term trends, ensuring that only short-term weather-induced variations in yield are analyzed. One commonly used approach involves fitting a predetermined function, such as a simple linear regression model or a second-order polynomial regression model, against time. This method has been widely applied by researchers to detrend crop yield data and evaluate the effects of climate variability. In this study, a simple linear regression model has been used to detrend the potato yield data. Numerous researchers have utilized this approach to remove trends from crop yield data and analyze the effects of climate variability^14,16,20.

Further, daily data on weather parameters, including maximum temperature (°C), minimum temperature (°C), rainfall (mm), relative humidity (%), wind speed (m/s) and solar radiation (MJ/m²/day), were collected for the potato cultivation period for the selected locations from the NASA POWER web portal (https://power.larc.nasa.gov/data-access-viewer/). The POWER data products used in the study have a spatial resolution of 0.5° latitude by 0.5° longitude, providing global coverage. There were no missing values in the used dataset. The weather data corresponds to the period of the potato crop standing in the field within the study region, starting from the first week of October (sowing) to the last week of January (harvesting). subsequently, weather indices were calculated. The output variable which is aimed to predict is potato yield (tons/ha), which serves as the dependent variable. The weather indices calculated from the six weather variables act as the independent variables. The weekly average was computed from the daily data for the analysis. This study used 70% of the data for model training and the rest 30% for model testing. The study assumed that in a large area like a district, differences in farming practices either stay the same or cancel each other out. Thus, the study considered that weather changes are the primary factor affecting crop yields¹².

Effect of weather variables on potato yield

Potatoes are highly sensitive to meteorological conditions, and climatic factors play a crucial role in influencing their growth and development. The primary climatic elements impacting potato yields include air temperature, rainfall, and light²¹. Air temperatures have a profound effect on various stages of the potato lifecycle, including germination, emergence, canopy development, tuber bulking and the duration of the growth period. The potato, being a crop with shallow roots that thrives in cooler seasons, benefits from lower nighttime temperatures²². Tables 1 and 2 provides the summary of the average weather variables throughout the potato growing period and the summary statistics of yield across various study locations respectively. The average maximum and minimum temperatures at all these locations fall within a favorable range, conducive to optimal potato production. Temperatures exceeding the optimal range negatively impact root and stolon development, delay tuber initiation, and reduce starch accumulation, thereby lowering yields²³. Inappropriate night temperatures hinder tuber formation, increase respiration, and accelerate the utilization of assimilates such as starch²⁴. Numerous earlier studies tend to prioritize temperature over other variables, consistently indicating that elevated temperatures have a detrimental impact on potato crops^25,26.

Table 1 Sample statistics of the weekly weather parameters during potato cultivation period of different locations.

Table 2 Summary statistics (Yield (t/ha)) of all the study locations.

Being a crop with shallow roots, potatoes are highly susceptible to water deficits. The optimal water requirement for potatoes ranges between 400 and 800 mm, influenced by weather conditions and agricultural management practices. The average seasonal Inadequate water availability, as indicated by studies^27,28, can detrimentally impact potato growth, development, yield, and tuber quality. Insufficient water leads to a reduction in the number of leaves, leaf areas, light energy availability and utilization, as well as tuber yield and quality²⁹. In arid and semi-arid regions, where low precipitation and higher evaporation rates prevail, water deficits often pose constraints on crop yields³⁰. Precipitation emerges as the primary climatic factor influencing potato yield, particularly in rain-fed cultivation areas with relatively low and unevenly distributed local rainfall²². Conversely, excessive rainfall can also be detrimental to potatoes, impeding proper respiration and oxygen exchange, leading to tissue decay within the potato tubers³¹. Table 1 presents the average seasonal rainfall figures for each study location, varying from 42–47 mm, which is insufficient for successful potato cultivation. The irrigation requirement in these districts compensates for the insufficient rainfall.

Sunshine, referred to as solar radiation, plays a crucial role in the germination and growth of potatoes. The absence of sunlight negatively affects potato development, leading to stunted growth^26,32,33,34. The energy absorbed by potato plant leaves from sunlight serves as a source for photosynthesis, contributing to an increase in surface temperature³⁵. The leaf’s capacity to absorb solar radiation is influenced by its area index or surface area. Enhanced solar radiation positively impacts potato crops, and the quantity and quality of sunlight intercepted by the canopy vary with seasonal changes and daily cycles due to Earth’s tilted axis affecting sunlight absorption³⁶.

Model training and testing for each district

To account for the distinct climatic, environmental, and agricultural conditions across the seven districts studied, each district was treated as a unique case. Separate machine learning models were trained and validated for each district using its respective dataset. This approach ensured that the models could capture location-specific patterns and relationships between weather variables and potato yields.

Computation of weather indices

The study by³⁷ highlights the complex relationship between weather variables and crop yield, acknowledging that not all-weather variables equally influence crop yield. This variation in influence is attributed to the fluctuating magnitudes of these variables at different stages of crop growth³⁸. To effectively capture this intrapersonal variability and its impact on crop yield, the study employed the use of computed weather indices as predictors. These indices served a dual purpose: they provided a more nuanced understanding of the weather’s influence on crop yield and also streamlined the analytical process by reducing the number of predictors involved in the study.

To compute the weighted weather indices, we used a two-step process. First, we calculated the correlation coefficients between the yield variable (potato yield) and each weather variable for every week of the crop growth cycle. Second, we computed the sum-products by multiplying the weekly values of each weather variable with their corresponding correlation coefficients derived in the first step. The unweighted weather index captures the cumulative effect of different weather variables over the crop growth period, while the weighted weather index highlights the influence of each weather variable based on its specific contribution to yield during each week. These indices effectively summarize the impact of weather on crop yield across different growth stages, reducing the number of predictors while retaining the most relevant information for the prediction process.

By computing and utilizing these weather indices, the study aimed to achieve a more accurate and efficient representation of the dynamic and complex interactions between weather patterns and crop yields. These indices were computed by using the following formula:

$${Z}_{ij}={\sum }_{w=1}^{n}{X}_{iw},{Z}_{ii{^{\prime}}j}={\sum }_{w=1}^{n}{X}_{iw}{X}_{i{^{\prime}}w}$$

$${Z}_{ii{^{\prime}}j}={\sum }_{w=1}^{n}{r}_{iw}^{j}{X}_{iw}, {Z}_{ii{^{\prime}}j}={\sum }_{w=1}^{n}{r}_{ii{^{\prime}}w}^{j}{X}_{iw}{X}_{i{^{\prime}}w}$$

where ${X}_{iw}$ is the value of ith weather variable in w^th week, ${r}_{iw}^{j}$ is correlation coefficient of detrended yield with i^th weather variable, m is the week of forecast. Indices with j = 0 are unweighted and j = 1 are weighed for clarity during the analysis. Calculation of these indices was done using Microsoft excel³⁹.

Machine learning models

Machine learning has emerged as a powerful decision-support tool for crop yield estimation, with multiple studies highlighting its effectiveness^40,41,42. Machine learning technology has the potential to assist farmers in lowering their farming losses by providing them with comprehensive crop advice and insights. In the present study, the machine learning models that were investigated are the ELNET, random forest, Artificial Neural Network (ANN), Extreme Gradient Boosting (XGBoost) and Support Vector Regression (SVR). The data analysis in this study was performed using RStudio software version 2024.12.0, an integrated development environment for R⁴³. Table 3 provides a summary of the best tuned hyperparameters used in all models. Figure 2 shows the steps involved during the testing and training of these models.

Table 3 Best tuned parameters of all employed models.

Random forest

Random Forest is a versatile machine learning method that has gained significant attention in various fields, including agriculture^{6,10,44,45,46}. This ensemble learning technique has been extensively studied and proven to be a valuable tool for predicting crop yield and providing essential insights for farmers. The Random Forest algorithm, as described by⁴⁷, is a non-parametric statistical method that utilizes an ensemble of decision trees, making it suitable for both regression and classification problems.

The strength of Random Forest lies in its ability to handle large datasets and complex relationships within the data, outperforming traditional models like logistic regression, as highlighted by⁴⁸. In the realm of agriculture, Random Forest has been widely adopted for diverse applications. For instance, Everingham et al.⁴⁹ demonstrated the accurate prediction of sugarcane yield using a Random Forest algorithm, showcasing its effectiveness in forecasting agricultural outcomes. Similarly, Bahri et al.⁵⁰ utilized Random Forest for credit scoring models in agriculture, emphasizing its adaptability and utility in financial assessments for farmers.

Extreme gradient boosting (XGBoost)

Extreme Gradient Boosting (XGBoost) is an ensemble learning algorithm that has gained significant attention in various fields due to its superior performance and robustness⁵¹. It is a machine learning technique that falls under the category of boosting algorithms, which sequentially adds predictors and corrects previous models using the gradient descent algorithm⁵². XGBoost is known for its high flexibility, strong predictability, generalization ability, scalability, and model training efficiency⁵¹. It is an ensemble algorithm based on trees or linear classifiers, and it integrates multiple tree models, providing stronger interpretability⁵³.

Artificial neural network (ANN)

ANNs, a class of machine learning models inspired by the human brain’s neural structure^54,55. The widespread adoption of ANNs is attributed to their ability to effectively address complex problems and learn from data, making them valuable tools for predictive modeling. In this study, an Artificial Neural Network (ANN) model, specifically a single hidden layer neural network, was implemented using the ‘nnet’ package. ANNs, designed to emulate human brain function, are adept at pattern recognition and learning through experience. The architecture employed here is a Multilayer Perceptron (MLP), a type of feed-forward neural network that includes input, hidden, and output layers. A key challenge with ANNs is finding the optimal number of hidden nodes to avoid overfitting. This study addressed this issue by utilizing cross-testing to fine-tune two hyperparameters, i.e. size and decay⁵⁶. The optimal number of hidden nodes was determined using the ‘train’ function from the ‘caret’ package. The ‘tuneGrid’ function assisted in optimizing the hidden layer’s nodes.

Support vector regression (SVR)

The Support Vector Regression (SVR) algorithm, proposed by⁵⁷, is an extension of the Support Vector Machine⁵⁸. It is a powerful technique in regression analysis, extending the capabilities of Support Vector Machines (SVMs) to predict continuous outcomes. SVR operates by establishing the best fit hyperplane in a high-dimensional space, effectively capturing the relationship between input variables and a continuous target variable⁵⁹. This hyperplane is determined by support vectors, which are the data points that lie closest to the hyperplane. The SVR algorithm aims to minimize the error between the predicted values and the actual values, while allowing for a certain degree of tolerance defined by the epsilon margin⁶⁰.

Elastic net (ELNET)

Elastic Net, introduced by⁵, overcomes the limitations inherent in both ridge and LASSO (Least Absolute Shrinkage and Selection Operator) regression methods. While LASSO regression excels when variables are less correlated, and ridge regression performs better when variables are highly correlated, both methods may struggle in models with many variables where the correlation structure is not well understood. Elastic Net addresses this issue by combining the penalties of both LASSO (l1 norm) and ridge (l2 norm) regressions. This hybrid approach enables Elastic Net to provide more accurate predictions by considering both types of regularization penalties.

Model evaluation metrics

The performance of the models was assessed using several key metrics, i.e. R², root mean square error (RMSE), normalized root mean square error (nRMSE), mean biased error (MBE) and Nash–Sutcliffe efficiency (NSE). The R² value approaching 1, along with RMSE (t/ha) and MBE (t/ha) values close to 0, signify superior model performance. MBE values can be either positive or negative, indicating overestimation or underestimation, respectively. Additionally, the performance of the model is categorized as excellent, good, fair, or poor, depending on the nRMSE value falling within specific ranges of 0–10%, 10–20%, 20–30%, or greater than 30%, respectively. Similar categorization is also done by^61,62. Further, the value of NSE ranged between − ∞ to 1. NSE value of 1 indicates perfect model performance, i.e. the model’s predictions exactly match the observed data. 0 < NSE < 1, indicate good model performance. The closer the NSE is to 1, the more accurate the model is. NSE < 0, indicates poor model performance. In this case, the observed mean is a better predictor than the model. The equations for these metrics are presented below:

$${R}^{2}={\left[\frac{n\left({\sum }_{i=1}^{n}{X}_{A,i}{X}_{M,i}\right)-\left({\sum }_{i=1}^{n}{X}_{A,i}{X}_{M,i}\right)}{\sqrt{\left\{n{\sum }_{i=1}^{n}{X}_{A,i}^{2}-{\left({\sum }_{i=1}^{n}{X}_{A,i}\right)}^{2}\right\}\left\{n{\sum }_{i=1}^{n}{X}_{M,i}^{2}-{\left({\sum }_{i=1}^{n}{X}_{M,i}\right)}^{2}\right\}}}\right]}^{2}$$

$$RMSE=\sqrt{\frac{1}{n}{{\sum }_{i=1}^{n}\left({X}_{A.i}-{X}_{M,i}\right)}^{2}}$$

$$nRMSE = \sqrt {\frac{1}{n}\sum _{{i = 1}}^{n} \left( {X_{{A.i}} – X_{{M,i}} } \right)^{2} } \times \frac{{100}}{{\bar{X}_{{A,i}} }}$$

$$MBE=\frac{{\sum }_{i=1}^{n}({X}_{A,i}\left.-{X}_{M,i}\right|)}{n}$$

$$NSE=1-\frac{{{\sum }_{i=1}^{n}\left({X}_{A.i}-{X}_{M,i}\right)}^{2}}{{{\sum }_{i=1}^{n}\left({X}_{A.i}-{\bar{X} }_{A,i}\right)}^{2}}$$

where, ${X}_{A,i}$ and ${X}_{M,i}$ are the actual and modeled value of the potato crop yield and n is the total number of observations^63,64,65.

Source link