Powdery mildew resistance prediction in Barley (Hordeum Vulgare L) with emphasis on machine learning approaches

Plant materials, experiment condition, and measurement of phenotypic data

The road map of this project is shown schematically in Fig. 10. A population of barley plants was created by crossing two cultivars, Badia (female) and Kavir (male). This population was then used to develop 103 F8-F9 lines using the SSD method. The varieties Badia (head heat sensitive) and Kavir (terminal heat tolerant) were licensed under ICARDA (International Center for Agricultural Research in Dry Areas) and SPII (Seed and Plant Improvement Institute). The F8-F9 lines were grown at the Gonbad Kavous University Research Farm in 2018/2019 and 2019/2020 using an α-lactic model on three planting dates (19 November, 19 January and 19 March) at three replicates.

The F8-F9 lines were grown at the Gonbad Kavous University Research Farm in 2018/2019 and 2019/2020 using an α-lactic model on three planting dates (19 November, 19 January and 19 March) at three replicates. Planting dates were set so the genotypes were exposed to various temperature and rainfall changes during planting and ripening. In this way, multiple data was obtained, and this issue became suitable for estimation and prediction. Compared to datasets in previous powdery mildew resistance studies, the dataset used in this research offers significant improvements in terms of size, genetic diversity, and environmental variations. The dataset is composed of a relatively large F8-F9 population of 130 barley lines, derived from a cross between genetically diverse cultivars (‘Badia’ and ‘Kavir’), thus encompassing considerable genetic diversity relevant to powdery mildew resistance. Furthermore, experiments were conducted across three distinct planting dates in two consecutive years (2018/2019 and 2019/2020), explicitly incorporating environmental variations. This multi-environment approach ensures the dataset is well-suited for analyzing genotype-by-environment interactions and developing robust prediction models applicable across diverse environmental conditions. In this research agronomically features were recorded. Theses features were containing plant height(PHI), awn length(AWL), shoot length(SHL), grain weight per spike(GWP), total biomass(BIO), number of grains(GRN), flag leaf area(FLA), flag leaf weight(FLW), Internode length(INTN), peduncle length(PEDL), grain shape(GRSH), awn weight(AWW), leaf weight under flag leaf(FLUW), Internode weight(INTW) and peduncle weight(PEDW).

Planting dates were set so the genotypes were exposed to various temperature and rainfall changes during planting and ripening. In this way, multiple data was obtained, and this issue became suitable for estimation and prediction.

In this study, the late sowing was determined based on the general meteorological data of the Gonbad Kavous region (Figs. 11, 12). Each line was planted in two rows, with a row spacing of 20 cm and a 270 seeds/m² plant density. Nutrient requirements were determined based on a soil test (Table 8). All crop management was done according to international practices and methods.

Table 8 Physical and chemical properties of used soil.

The susceptible cultivar ‘Afzal’ was planted between and around the outside of the field trial as disease spreader. Fertilizing and other crop care followed international protocols. Infected Barley leaves by B. graminis fsp. tritici-Gonbad isolates were collected during the respective trial seasons from barley fields at the Golestan Agricultural and Natural Resources Research and Education Center (AREEO). In greenhouse (20 ± 2 °C, 70–90% relative humidity under 16 h photoperiod), leaves of one-week-old seedlings of susceptible cultivar, Afzal, were subjected to inoculation by shaking the conidia of infected leaves over the healthy leaves of Afzal seedlings. After 8–10 days, various pots of susceptible cultivars with fully developed pathogen colonies were placed uniformly between lines in the field. Inoculation was performed in the fourth stage of Zadoks decimal code at the booting stage (Fig. 13).

The disease evaluation was conducted weekly after the first occurrence of disease symptoms in the susceptible cultivar using a double-digit scale (00–99)²³. Area Under Disease Progress Curve (AUDPC) was calculated as²⁴:

$$\text{AUDPC}={\sum }_{i=1}^{n}\frac{({x}_{i+1}+{x}_{i})({t}_{i+1}-{t}_{i})}{2}$$

(1)

where x_i: disease severity at t_i, xi + 1: disease severity at t_i+1, t_i+1-t_i: time intervals (days) between two records of the disease, n: number of recording times.

The measurement of genotypic data

To isolate genomic DNA, 3–4 fresh leaf samples were taken from random plants, and then the genomic DNA of the leaf samples was extracted using the CTAB method²⁵. Spectrophotometry and horizontal agarose gel electrophoresis on a 0.8% agarose gel were used to ensure sufficient quantity and quality of genomic DNA. The SSR markers were used as anchors in constructing a genetic map²⁶. In addition, Inter Simple Sequence Repeat (ISSR)²⁷, Random Amplified Polymorphic DNA (RAPD)²⁸, interprimer binding site (iPBS)²⁹, inter retrotransposon amplified polymorphism (IRAP)³⁰, Start Codon Target (SCoT), CAAT Box-Derived Polymorphism (CBDP)³¹, ISSR-iPBS combined and iPBS- iPBS combined were used. SSR markers were selected to have at least one marker on each chromosome arm (http://wheat.pw.usda.gov/GG3/). 10 µL of polymerase chain reaction (PCR) reaction mixture for SSR markers contained 2.5 µL DNA, 0.48 µL MgCl₂ (50 mM), 0.6 µL dNTP (10 mM), 0.75 µL pmol primer (10 µl primer), 0.75 µl revision primer (10 pmol), 1 µl PCR buffer, 0.12 µl Taq DNA polymerase enzyme (5 U/ml) and 3.8 µl sterile distilled water. The PCR reaction mix for the dominant markers was identical, except that 1.5 µl primers were used. PCR was performed using a Thermal Cycler (Bio-Rad, USA) under the contact heat program³². Vertical polyacrylamide gel electrophoresis on a 6% polyacrylamide gel was used to separate the layers, and gel staining was performed using the silver nitrate method³³. Finally 719 PolyMorphic bands were used as genotypic features.

Data preprocessing

Prior to the application of feature selection algorithms and machine learning models, the dataset underwent rigorous preprocessing and quality control procedures to ensure data integrity and consistency. These steps are Data Cleaning, Missing Value Handling, Data Transformation and Data Type Consistency.

Data cleaning: A comprehensive inspection of both phenotypic and genotypic data was performed to identify and rectify any inconsistencies, errors, or outliers. Erroneous data points, such as negative values for physical measurements or biologically impossible values, were removed from the dataset. This manual inspection and correction process ensured the removal of obvious errors that could negatively impact model training.

Missing value handling: The extent of missing values was assessed separately for phenotypic and genotypic datasets. For phenotypic data, missing values were minimal, representing less than 2% of the total data points. Given the random distribution of these missing values and their low overall proportion, we employed mean imputation. Specifically, missing values for a given feature were replaced with the mean value of that feature calculated across all samples within the same planting date and year. This approach was chosen to minimize bias, as the missingness was not concentrated in any particular group or feature. For genotypic data, represented as binary scores (presence/absence of a marker), missing values were coded as ‘0’, signifying the absence of the marker.

Data transformation: Phenotypic data, comprising continuous variables (e.g., plant height, leaf weight, awn length), were standardized using z-score normalization. This transformation ensures that all phenotypic features have a mean of 0 and a standard deviation of 1, preventing features with larger scales from disproportionately influencing the machine learning models. The formula for z-score normalization is: z = (x − μ)/σ, where x is the original value, μ is the mean of the feature, and σ is the standard deviation of the feature. Genotypic data, being binary, did not require normalization.

Data type consistency: We meticulously verified that all data types were consistent with their roles in the subsequent analyses. Phenotypic and disease severity data were confirmed as numeric and genotypic data were confirmed as binary. This step ensured compatibility with the chosen machine learning algorithms.

Feature selection

Feature selection algorithms reduce the dimensions of the input data. Due to constraints coupled with selected or ignored features and subset sizes, feature selection algorithms try to find a subset of features that predicts the ML output well. Increasing prediction performance and providing faster and more cost-effective prediction performances are the main advantages of feature selection. Even if all features are applicable and contain information on the output variable, using all features can reduce the overall prediction performance. Although feature selection techniques are established methodologies in machine learning and have been previously used in plant science, their systematic and comparative application within the context of disease resistance prediction in barley, particularly using combined genotypic and phenotypic data, represents a distinct aspect of this study. In this research, we employed three distinct feature selection algorithms – RReliefF, Minimum Redundancy Maximum Relevance (MRMR), and F-test – to comprehensively identify the most informative features for predicting powdery mildew resistance in barley.The RReliefF algorithm is an extension of the Relief algorithm, designed for feature selection in ML tasks. It is beneficial for dealing with datasets that have noisy, incomplete, or multi-class data. The algorithm works by assigning weights to features based on their capability to discriminate between instances that are near to each other. Features that are more relevant for predicting the output variable receive higher weights. At the same time, irrelevant features receive lower weights, which can then be used to select a subset of features for model training³⁴.

The Minimum Redundancy Maximum Relevance (MRMR) feature selection algorithm is used to identify a subset of highly relevant features. The selected features predict the output variable while being minimally redundant among themselves. The MRMR algorithm operates on mutual information, quantifying the information shared between features and the target variable and between the features themselves³⁵.

The F-Test feature selection algorithm is a statistical method for selecting features with significant predictive power regarding the output variable. It is based on the F-statistic, which is derived from ANOVA tests. The F-test evaluates the degree of variance between groups compared to the variance within groups for each feature individually. The F-test assesses the null hypothesis that the means of the different groups are equal, implying that the feature does not discriminate well between the groups. A low F-value suggests that the feature does not contribute much to the group differences.

In contrast, a high F-value indicates that the feature is a good discriminator and, thus, potentially useful for prediction. However, the F-Test may overlook features only relevant in combination with others because it is a univariate method. Despite this limitation, it remains a popular choice for the initial screening of features to reduce dimensionality before applying more complex feature selection methods or training ML models.

ML models

The DT, RF, NET, and GPR models are deployed in this paper. This Section presents a brief description of these models. The ML models are introduced more conceptually than mathematically. The mathematical explanations of the models can be found in textbooks^36,37. These machine learning models are widely recognized and used in various fields, including plant science, but our study’s novelty lies in their specific and comprehensive application to predict powdery mildew resistance in barley using a novel combination of feature selection methods and integrated genotypic and phenotypic datasets. Furthermore, we employed Bayesian optimization to fine-tune the hyperparameters of these models, enhancing their predictive performance specifically for the research context.

DTs are a non-parametric supervised learning method for classification and regression tasks. Mathematically, a DT is a flowchart-like structure where each node denotes a “test” on a feature, each branch is the result of the test, and each node symbolizes a class label. The paths from root to leaf represent prediction rules. In a DT model, the data is split according to a specific criterion selected to maximize the homogeneity of the resulting subsets. The most common criterion for splitting used for trees, where the target variable is continuous, is variance reduction. The goal is to find the split that reduces the variance of the target variable within the subsets. The variance reduction for a split is given by

$$VarReduction=Var\left(S\right)-{\sum }_{v\in Values\left(A\right)}\frac{\left|{S}_{v}\right|}{\left|S\right|}Var\left({S}_{v}\right)$$

(2)

where Var(S) is the variance of the target variable in subset S. The DT algorithm recursively parses the data until it meets a stopping criterion. The resulting model provides an interpretable representation of the decision-making process. To mitigate overfitting, techniques such as pruning (removing parts of the tree that do not give power to classify instances), setting a maximum depth for the tree, or requiring a minimum number of samples to split a node can be used³⁸.

The RF model is an ensemble learning method that operates by constructing many DTs at the training stage and outputting the class of the mean prediction of the individual trees in regression problems. Mathematically, an RF aggregates the predictions of several DTs to make more accurate and stable predictions than a single DT. An RF comprises many individual DTs that operate as an ensemble. The fundamental concept behind RF is a simple but powerful one. Many relatively uncorrelated models (trees) operating as a committee will outperform any individual constituent models. The low correlation between models is the key. So, the forest can perform well in a wide range of data sets, including those with high-dimensional feature spaces and complex data structures. In RFs, the tree construction process incorporates feature bagging (bootstrap aggregating), which involves randomly selecting a subset of features for consideration at each split point. This process de-correlates the trees by giving them different perspectives of the data to learn from, and hence, it ensures that the ensemble model does not overfit the training data. Mathematically, the prediction of an RF for a regression problem can be expressed as the average of the predictions of the N individual trees:

$$RF\left(x\right)=\frac{1}{N}{\sum }_{i=1}^{N}{T}_{i}\left(x\right)$$

(3)

where RF(x) is the prediction of the RF for input x, N is the number of trees, and ${T}_{i}\left(x\right)$ is the prediction of the ith tree. The RFs are robust against overfitting as they average out biases, can handle missing values, and are relatively unaffected by outliers, making them a powerful tool for a wide range of machine-learning tasks³⁹.

The NETs represent a category of ML models that draw inspiration from the biological structure and operational mechanisms of the human brain. In the context of regression problems, NETs aim to predict continuous outcomes by learning complex mappings from inputs to outputs. The fundamental building block of a NET is the neuron, or node, which receives inputs, applies a weighted sum, and then typically passes the result through a non-linear activation function. A simple NET for regression can be represented as follows: The input layer consists of neurons corresponding to the dataset’s features. If there are n features, there will be n input neurons. One or more hidden layers can exist, each comprising many neurons. The output of each neuron in a hidden layer is a function of the weighted sum of its inputs plus a bias term:

$${h}_{ij}=f\left({\sum }_{k=1}^{n}{w}_{ijk}\cdot {x}_{k}+{b}_{ij}\right)$$

(4)

where ${h}_{ij}$ is the output of the jth neuron in the ith hidden layer, f is the activation function, ${w}_{ijk}$ are the weights, ${x}_{k}$ are the input features and ${b}_{ij}$ is the bias term³⁸. The output layer produces the final prediction for the regression problem. In a single-output network, there is only one neuron in this layer, and its output is the predicted value:

$$y=f\left({\sum }_{j=1}^{m}{w}_{oj}\cdot {h}_{j}+{b}_{o}\right)$$

(5)

where y is the predicted output, ${w}_{oj}$ are the weights from the last hidden layer to the output neuron, ${h}_{j}$ are the outputs of the last hidden layer’s neurons, ${b}_{o}$ is the output neuron’s bias, and m is the number of neurons in the last hidden layer. The activation function f introduces non-linearity into the model, allowing the network to capture complex relationships. Common choices for f include the sigmoid, tanh, and ReLU functions. The training process for a NET entails the iterative adjustment of weights and biases. This is performed to minimize a predefined loss function, which quantifies the discrepancy between the model’s predicted outputs and the corresponding ground-truth values. Gradient descent, often with backpropagation, is a common optimization technique used to perform this adjustment iteratively. NETs can be extended to include multiple hidden layers (deep learning), various types of layers (like convolutional or recurrent layers), and different architectures tailored to specific types of regression problems. The flexibility and capacity of NETs to model non-linear relationships make them powerful tools for regression analysis⁴⁰.

The GPR is a non-parametric, Bayesian approach to regression that is particularly powerful for modeling unknown functions and making predictions with an associated measure of uncertainty. The foundation of GPR lies in the assumption that the function values being predicted can be described by a Gaussian process—a collection of random variables, any finite number of which have a joint Gaussian distribution. Mathematically, a Gaussian process is fully specified by its mean function m(x) and covariance function $k\left(x.{x}{\prime}\right)$, also known as the kernel. The mean function represents the average prediction for the function at point x, and the covariance function encodes assumptions about the function’s smoothness and how points in the input space relate to each other. Given a set of observations $y$ at locations {X}, GPR infers the distribution over functions consistent with these observations. The predictive distribution for a new input ${x}_{*}$ is also Gaussian, with the mean and variance:

$$[\upmu \left({x}_{*}\right)={k}_{*}^{T}{\left(K+{\upsigma }_{n}^{2}I\right)}^{-1}y]$$

(6)

$$[{\upsigma }^{2}\left({x}_{*}\right)=k\left({x}_{*}.{x}_{*}\right)-{k}_{*}^{T}{\left(K+{\upsigma }_{n}^{2}I\right)}^{-1}{k}_{*}]$$

(7)

where ${k}_{*}$ is the vector of covariances between the new point and the training points, K is the covariance matrix of the training points, and the ${\sigma }_{n}^{2}$ is the noise variance. The GPR has several advantages, including uncertainty estimates for predictions and flexibility in choosing the kernel to encode prior beliefs about the function’s properties. However, it also faces computational challenges, particularly for large datasets, due to the inversion of the covariance matrix⁴¹.

although the individual machine learning models (Decision Tree, Random Forest, Neural Network, Gaussian Process Regression), feature selection algorithms (RReliefF, MRMR, F-test), and the Bayesian optimization approach for hyperparameter tuning employed in this study are established techniques, their specific integration and application within the context of powdery mildew resistance prediction in barley constitutes a novel methodological framework. Unlike previous studies that may have utilized subsets of these methods in isolation, our study systematically combines and compares a comprehensive suite of feature selection algorithms and machine learning models, optimized through Bayesian methods, to identify the most effective approach for predicting powdery mildew resistance from combined phenotypic and genotypic data. This holistic methodological approach, focused on comparative evaluation and optimized integration, differentiates this study and contributes to the advancement of machine learning applications in plant disease resistance research.

Evaluation of the ML models

To statistically validate the performance differences between the models, we utilized Friedman’s test. Friedman’s test is a non-parametric statistical test suitable for comparing the performance of multiple models⁴². In this study, it was employed for pairwise comparisons of the ML models’ performance across different evaluation metrics. This study used 70% of the data for training and 30% for testing. Data validation was done in the training phase using five-fold cross-validation. Using RMSE, MAE, MSE, and R² criteria, the model’s performance is evaluated in each training and testing stage.

$$MAE=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{{y}_{i}-\overline{{y }_{i}}}{{y}_{i}}\right|$$

(8)

$$RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left({y}_{i}-\overline{{y }_{i}}\right)}^{2}}$$

(9)

$${R}^{2}=1-\frac{\sum_{i=1}^{n}{\left({y}_{i}-\overline{{y }_{i}}\right)}^{2}}{\sum_{i=1}^{n}{\left({y}_{1}-{y}_{ave}\right)}^{2}}$$

(10)

In these equations, ${y}_{i} and \overline{{y }_{i}}$ are predicted value and actual value, ${y}_{ave}$ is the average of data set values, and n is the number of observations.

Source link