Study area
This research was conducted in Fars Province, located in southern Iran (27°02′–31°42′ N, 50°42′–55°36′ E) (Fig. 1), covering an area of approximately 122,000 km². The province exhibits diverse topographic and ecological conditions, ranging from arid lowlands to temperate highlands, making it an ecologically heterogeneous region well-suited for medicinal plant research. Elevations vary significantly from approximately 450 m in the southwestern lowlands to over 4,000 m in the Zagros Mountain range, creating substantial spatial gradients that influence vegetation patterns, soil composition, and the distribution of medicinal plants27,28. N. persica primarily thrives in montane and foothill ecosystems, where moderate water availability, well-drained soils, and microclimatic stability promote its growth and secondary metabolite production29. The dominant soil types in the study area were calcareous soils with sandy-loam to clay-loam textures, moderate organic matter content, and slightly alkaline pH, factors known to influence secondary metabolite biosynthesis. Hydrological features, including seasonal rivers, ephemeral streams, and groundwater resources, further contribute to habitat heterogeneity and ecological dynamics30,31. Given its environmental variability and ecological significance, Fars Province serves as an ideal case study for assessing how climatic, edaphic, and topographic factors drive habitat suitability and secondary metabolite accumulation in N. persica.

Study area in Fars province, Southwest Iran (ArcGIS Desktop 10.8 (ESRI, Redlands, CA, USA). Available at: https://www.esri.com/en-us/arcgis/products/arcgis-desktop/overview).
Methodology
This study, conducted in Fars Province, Iran, identified N. persica at 62 locations. Secondary metabolites were extracted using a Clevenger apparatus and analyzed via gas chromatography (GC) and gas chromatography-mass spectrometry (GC-MS). A total of 18 environmental variables—encompassing climatic, topographic, and edaphic factors—were processed and converted into raster layers with 30-meter resolution using GIS (ArcGIS Desktop 10.8). Four machine learning models were used to predict nepetalactone concentration: random forest (RF), support vector machine (SVM), gradient boosting machine (GBM), and a hybrid ensemble model (RF–SVM–GBM). The most influential environmental drivers were identified using partial least squares (PLS) regression and generalized linear models (GLM). Model performance was assessed using four evaluation metrics: root mean square error (RMSE), mean absolute error (MAE), coefficient of determination (R²), and concordance correlation coefficient (CCC). The final habitat suitability map highlighted optimal ecological zones for high nepetalactone accumulation across Fars Province. Figure 2 illustrates the overall methodology of the study.

Overview of the research methodology.
Habitat identification and data collection
Field sampling was conducted in Fars Province, Iran, to identify and document the natural habitat of N. persica. Preliminary distribution data were obtained from the provincial Agricultural Jihad Organization, guiding targeted field surveys. A stratified sampling approach was used to cover diverse environmental conditions, ensuring comprehensive habitat representation. At each confirmed location, geographic coordinates were recorded using GPS, and both plant and soil samples were collected (Fig. 3). Aerial parts of the plant, including leaves and flowers, were harvested at their peak vegetative stage, while the soil samples were taken from the same points. The samples were transported to the laboratory for chemical analysis. This method ensured precise documentation of N. persica distribution while maintaining scientific rigor in sample collection and habitat characterization.
Plant sampling was conducted during May and June 2024, strategically aligned with the reproductive phase of N. persica, a stage well-documented for the peak accumulation of secondary metabolites, including nepetalactone32,33. Considering the topographic and climatic heterogeneity of Fars Province, sampling commenced in the warmer lowland areas during early May and gradually progressed towards higher elevation, cooler regions throughout June. This approach ensured that all samples were collected during a comparable developmental stage of the species, thereby minimizing phenological variability in metabolite profiles.
A total of 62 sampling sites were selected based on the confirmed occurrence records of N. persica, derived from field surveys and distributional data provided by the Agricultural Jihad Organization of Fars Province. It is important to note that the species exhibits a naturally fragmented and ecologically constrained distribution, which inherently limits the number and spatial density of viable sampling locations. Despite these limitations, previous studies have demonstrated that sample sizes ranging from 50 to 100, when properly distributed, can provide statistically robust insights into the spatial patterns of plant secondary metabolite production, even in ecologically complex regions34,35.

Field sampling and habitat identification of N. persica in Fars Province, including GPS data recording, plant collection, and soil sampling (Photo by: Dr. Emran Dastres).
Isolation and analysis of essential oils
The aerial parts of N. persica were shade-dried at ambient temperature (25 ± 2 °C). Fifty grams of the dried material were subjected to hydrodistillation with 500 mL of distilled water using a Clevenger-type apparatus36. The isolation was conducted for 4 h under standardized conditions (100 °C, atmospheric pressure) to maximize concentration while minimizing artifact formation. The obtained essential oils were analyzed by gas chromatography (GC) coupled with flame ionization detection (GC-FID) for quantification and gas chromatography-mass spectrometry (GC-MS) for compound identification37. GC-MS spectra were cross-referenced with the NIST 2020 mass spectral library and authentic standards where available. Special emphasis was placed on nepetalactone due to its established bioactivity, though the full volatile profile was characterized to assess compositional diversity. This dual analytical approach provided comprehensive qualitative and quantitative data on the essential oil constituents, ensuring reliable metabolite identification.
Preparing environmental factor maps
To assess the potential distribution of suitable habitats for N. persica across the study area, we evaluated three environmental variables: climatic, topographic, and edaphic factors. From these categories, 18 variables were meticulously selected based on their ecological significance to the species’ growth and the reliability of the data sources. The selection process was guided by ecological principles and the feasibility of obtaining consistent data, ensuring that the variables accurately reflected the environmental conditions impacting the species’ habitat preferences38. This thorough approach was vital for improving the precision and reliability of the predictive model.
Edaphic factors
In this research, 62 soil samples were collected from various N. persica habitats, with depths ranging from 0 to 30 cm. These samples were air-dried at room temperature and sifted through a 2-mm mesh to prepare them for laboratory analysis. Laboratory tests were performed to assess soil characteristics, including organic matter (OM), electrical conductivity (EC), pH, and the proportions of sand, silt, clay, nitrogen (N), phosphorus (P), and potassium (K). Soil texture was analyzed using the hydrometer method39while pH was determined with an electronic pH meter40. Nitrogen content was measured using the Kjeldahl method41potassium was extracted using ammonium acetate42phosphorus was evaluated via the Olsen method43and organic matter was quantified using the Walkley-Black technique44. Electrical conductivity was determined by creating a soil-water suspension and analyzing it with an EC meter45.
The laboratory-derived soil properties were spatially interpolated using the Inverse Distance Weighting (IDW) method to generate high-resolution thematic maps for the study area46. A total of nine raster maps were produced, representing key soil variables such as organic matter (OM), electrical conductivity (EC), pH, and the proportions of sand, silt, clay, nitrogen (N), phosphorus (P), and potassium (K) (Fig. 4A–I). These maps provided a detailed spatial representation of soil characteristics, enhancing the environmental predictor dataset for subsequent habitat suitability modeling of N. persica.
Given the naturally fragmented distribution of N. persica populations, coupled with the rugged, mountainous terrain of the study area, the spatial distribution of samples was inherently irregular and limited in density. As a result, the application of geostatistical methods such as Kriging, which requires a sufficiently dense, homogeneous, and regularly spaced dataset to construct a reliable variogram, was not feasible.
Therefore, IDW was selected as the sole interpolation technique, in line with established recommendations for ecological studies under constrained sampling conditions47,48,49. IDW provides deterministic, distance-weighted estimations without relying on strict assumptions regarding spatial autocorrelation, making it particularly appropriate for heterogeneous landscapes.
To evaluate the accuracy of the interpolation outputs, a leave-one-out cross-validation procedure was implemented for each soil variable. The resulting Root Mean Square Error (RMSE) values ranged from ± 0.12 to ± 0.28 standardized units across different soil parameters, indicating acceptable interpolation accuracy for ecological modeling purposes in mountainous environments.
Climatic factors
To generate high-resolution spatial maps of temperature and precipitation across the study area, we utilized Sentinel-2 multispectral satellite imagery, complemented by ground-based meteorological data from the Iranian Meteorological Organization (IRIMO). The Sentinel-2 system provides 13 spectral bands, including visible, near-infrared, and shortwave infrared (SWIR) wavelengths, which are essential for estimating key environmental variables50.
All satellite imagery and ground observations used in this study correspond to a five-year reference period (2019–2023), ensuring that the derived climate variables reflect multi-year average conditions rather than short-term anomalies.
In total, 480 cloud-free Sentinel-2 scenes were acquired and processed using the Google Earth Engine (GEE) platform and Sentinel Hub services. Atmospheric correction was applied using the Sen2Cor algorithm to minimize the effects of atmospheric scattering and cloud contamination51. All imagery was resampled to a consistent spatial resolution of 30 m and clipped to the study area boundary for further analysis.
Although Sentinel-2 does not carry dedicated thermal infrared sensors, several studies have demonstrated that Land Surface Temperature (LST) can be effectively estimated using SWIR bands (Bands 11 and 12) and ancillary topographic data52.
We applied a regionally calibrated empirical model based on the split-window algorithm expressed as:
$${\text{LST}}\,=\,{\text{f }}\left( {{\text{B11}},{\text{ B12}},{\text{ Elevation}},{\text{ Time of Acquisition}}} \right)$$
(1)
Where:
B11 and B12 represent Sentinel-2 shortwave infrared bands, sensitive to surface moisture and heat emission; Elevation was derived from a Digital Elevation Model (DEM); Time of Acquisition accounts for solar angle effects on surface heating.
The resulting LST maps represent the multi-year mean annual surface temperature for the period 2019–2023. The accuracy of LST estimation was assessed based on prior validation studies in similar mountainous terrains, with typical uncertainties ranging from ± 1.5 °C to ± 2 °C.
Precipitation, although not directly measured by Sentinel-2, was estimated using a vegetation-based proxy model, calibrated with local ground observations.
The Normalized Difference Vegetation Index (NDVI), derived from Bands 4 (Red) and 8 (Near-Infrared), serves as an indicator of vegetation productivity, which correlates with water availability. Additionally, the SWIR band (B11) was used as a proxy for atmospheric moisture. The precipitation model is expressed as:
$$\:P=a\times\:NDVI+b\times\:B11+C$$
(2)
Where:
P is the estimated mean annual precipitation (mm/year); a, b, and c are empirical coefficients derived from calibration with ground-based rainfall data.
Calibration of the precipitation model utilized observational records from 32 synoptic meteorological stations distributed across Fars Province. These stations ensure broad spatial coverage, capturing variations across different elevations and climatic zones.
The precipitation model achieved a coefficient of determination (R²) of 0.76 and a Root Mean Square Error (RMSE) of ± 47 mm/year, indicating acceptable predictive accuracy for large-scale ecological modeling in arid and semi-arid environments. The estimated precipitation maps correspond to the multi-year mean annual precipitation for the period 2019–2023.
All final temperature and precipitation maps were produced in raster format with a 30-meter spatial resolution and integrated into the GIS database for further spatial analysis (see Fig. 4J, K).
Topographic factors
Topographic features play a crucial role in environmental and ecological modeling as they influence climate conditions, soil properties, vegetation distribution, and hydrological processes53. In this research, Alaska satellite images were utilized to extract key topographic parameters, including elevation, slope, slope direction (aspect), and surface curvature. The Alaska satellite images provided high-resolution Digital Elevation Models (DEMs), which enabled the precise extraction of topographic factors54.
Elevation represented the height above sea level and was directly extracted from the DEM raster layer. Slope measured the steepness or inclination of the terrain and was calculated as the rate of change in elevation. Aspect defined the direction a slope was facing, measured in degrees from north (0°) to west (360°). Curvature describes the concavity or convexity of the land surface, which influences water flow and soil erosion.
To account for anthropogenic impacts on habitat dynamics, road and river network vector data were integrated into the analysis. Using ArcGIS’s Euclidean distance algorithm, these linear features were transformed into continuous raster surfaces, quantifying proximity to the nearest road and river. The derived layers were resampled to a consistent 30-meter spatial resolution to align with other factors. All input layers underwent z-score standardization to ensure comparability and facilitate their incorporation into machine learning-based habitat suitability models. Following spatial harmonization (coordinate system alignment and clipping to the study area boundary), the anthropogenic and topographic predictors were incorporated into the analytical framework (Fig. 4L and R).
18 raster layers represent key variables categorized into three groups: Edaphic (A: Clay percent; B: Silt percent; C: Sand percent; D: Electrical conductivity; E: pH; F: Phosphorus; G: Potassium; H: Nitrogen; I: Organic matter); Climatic (J: Mean annual rainfall; K: Mean annual temperature); Topographic (L: Slope aspect; M: Elevation; N: Plan curvature; O: Profile curvature; P: Slope degree; Q: Distance from rivers; R: Distance from roads). (ArcGIS Desktop 10.8 (ESRI, Redlands, CA, USA). Available at: https://www.esri.com/en-us/arcgis/products/arcgis-desktop/overview).
Habitat suitability prediction
Modeling using the random forest algorithm
The random forest (RF) algorithm, an ensemble learning method introduced by Breiman55 was employed to model the habitat suitability of N. persica based on multiple environmental predictors. RF constructs an ensemble of decision trees, each trained on a bootstrap sample of the dataset, and combines their outputs to improve predictive performance21. The algorithm is particularly effective for handling high-dimensional and nonlinear relationships between predictors, making it well-suited for ecological and environmental modeling56.
RF was implemented using the random Forest package (https://cran.r-project.org/web/packages/randomForest/index.html) in R (version 4.3.2) with 500 trees. The optimal number of predictor variables randomly selected at each split was determined through cross-validation to minimize the Out-Of-Bag (OOB) error.
RF constructs an ensemble of decision trees T1, T2, …, Tn, where each tree Ti is trained on a bootstrap sample. The final prediction for an input x is determined by aggregating the outputs of individual trees:
$$\:{\hat{y}}=\frac{1}{n}\sum\:_{i=1}^{n}{T}_{i}\left(x\right)$$
(3)
where n represents the number of trees. For classification tasks, majority voting is used, whereas for regression tasks, the average of tree predictions is taken. The importance of each predictor variable Xj is quantified based on its contribution to model accuracy:
$$\:VI\left({X}_{j}\right)=\frac{1}{n}\sum\:_{i=1}^{n}({A}_{i}-{A}_{i}^{*})$$
(4)
Where \(\:{A}_{i}\:\)is the accuracy of tree iii before permutation and \(\:{A}_{i}^{*}\) is the accuracy after permuting \(\:{X}_{j}\).
Modeling using the gradient boosting machine
Gradient Boosting Machines (GBM) is an ensemble learning method that builds predictive models sequentially by optimizing weak learners, typically decision trees, to minimize a predefined loss function57. Unlike Random Forest, which builds independent trees, GBM constructs trees iteratively, where each tree corrects the errors of the previous ones, improving model performance58. This approach makes GBM highly effective for complex, nonlinear ecological and environmental modeling tasks25.
GBM was implemented using the xgboost package (https://cran.r-project.org/web/packages/xgboost/index.html) in R (version 4.3.2), with hyperparameters tuned via cross-validation. The learning rate, maximum depth of trees, and the number of boosting iterations were optimized to enhance model accuracy while preventing overfitting.
In GBM, the prediction function is formulated as a sum of weak learners:
$$\:{F}_{m}\left(x\right)={F}_{m-1}\left(x\right)+{\gamma\:h}_{m}\left(x\right)$$
(5)
where \(\:{F}_{m}\left(x\right)\:\)is the model at the \(\:m-th\) iteration, is the weak learner, and \(\:\gamma\:\) is the learning rate. The model minimizes a loss function \(\:{h}_{m}\left(x\right)\) using gradient descent:
$$\:{F}_{m}\left(x\right)=argmin\sum\:_{i=1}^{n}L(yi.{F}_{m-1}{(x}_{i})+h{(x}_{i}))$$
(6)
Variable importance in GBM is determined using the gain metric, representing the contribution of each predictor to reducing the loss function at each split in the decision trees.
Modeling using the support vector machines
Support vector machines (SVM) is a supervised learning algorithm that constructs a hyperplane in a high-dimensional space to separate data into different classes or predict continuous values in regression tasks59. SVM is particularly effective for handling small datasets with high-dimensional feature spaces, making it suitable for ecological modeling24.
SVM was implemented using the e1071 package (https://cran.r-project.org/web/packages/e1071/index.html) in R (version 4.3.2), with the radial basis function (RBF) kernel to capture complex nonlinear relationships between environmental predictors and habitat suitability. The optimal values for the regularization parameter \(\:C\) and kernel width \(\:\gamma\:\) were determined via cross-validation.
For classification tasks, SVM finds an optimal hyperplane that maximizes the margin between classes:
$${\text{max}}\frac{2}{{\left\| {\mathbf{w}} \right\|~}}.~subject~to~yi\left( {{\mathbf{w}}.xi + b} \right) \ge 1.~\forall i$$
(7)
where w is the weight vector, \(\:b\) is the bias term, and \(\:yi\) represents class labels.
For regression tasks (support vector regression, SVR), the model minimizes the following objective function:
$$min\frac{1}{2}~\left\| {\mathbf{w}} \right\|^{2} + C\mathop \sum \limits_{{i = 1}}^{n} \max \left( {0.~\left| {yi – \left( {{\mathbf{w}}.~xi + b} \right)} \right| – ~ \in } \right)$$
(8)
where \(\:C\) controls the trade-off between model complexity and error tolerance \(\:\in\:\). The importance of predictor variables in SVM was assessed using sensitivity analysis, evaluating how variations in each feature affect model predictions.
Before model training, all predictor variables were screened for multi-collinearity using two complementary diagnostic tests: The Variance Inflation Factor (VIF) and Tolerance values. Variables with a VIF exceeding 5 or a Tolerance below 0.2 were considered indicative of significant multi-collinearity60. Such variables were either excluded from the final modeling dataset or combined with other correlated predictors to reduce redundancy. This pre-processing step ensured the stability and interpretability of the models while minimizing distortions in variable importance rankings.
Hybrid ensemble modeling
Hybrid ensemble models integrate multiple machine learning algorithms—such as random forest (RF), support vector machines (SVM), and Gradient Boosting Machines (GBM)—to enhance predictive accuracy and generalization. These models leverage the complementary strengths of individual algorithms while mitigating their limitations by combining outputs through techniques such as weighted averaging or stacking61.
In this research, hybrid modeling was implemented using the SuperLearner package (https://cran.r-project.org/web/packages/SuperLearner/index.html) in R (version 4.3.2). Two ensemble strategies were applied:
Weighted averaging
The final prediction (\(\:{\hat{y}}\)) was computed as a weighted sum of the individual model predictions:
$$\:{\hat{y}}={w}_{1}*{\hat{y}}\_RF+{w}_{2}*{\hat{y}}\_SVM+{w}_{3}*{\hat{y}}\_GBM$$
(9)
Where, \(\:{\hat{y}}\_RF\), \(\:{\hat{y}}\_SVM\), and \(\:{\hat{y}}\_GBM\) are predictions from RF, SVM, and GBM models, respectively; \(\:{w}_{1}\), \(\:{w}_{2}\), and \(\:{w}_{3}\) are non-negative weights summing to 1, optimized through grid search to minimize the Root Mean Square Error (RMSE) on a held-out validation set (20% of the data).
The optimal weights obtained were: w₁ = 0.46 (RF), w₂ = 0.29 (SVM), and w₃ = 0.25 (GBM).
Stacking (meta-learner approach)
In stacking, a meta-learner combines base model predictions to refine final outputs:
$$\:{{\hat{y}}}_{i}=f({{\hat{y}}}_{i}^{RF}+{{\hat{y}}}_{i}^{SVM}+{{\hat{y}}}_{i}^{GBM})$$
(10)
Where \(\:f\) represents the meta-learner function. A linear regression model was used as the meta-learner due to its interpretability and low susceptibility to overfitting62.
To prevent information leakage and overfitting, we employed a nested cross-validation scheme: An outer 5-fold cross-validation split the dataset into training and testing subsets; Within each training fold, an inner 5-fold cross-validation optimized both base models and meta-learner parameters.
This approach ensures that the meta-learner is trained exclusively on out-of-fold predictions, thus maintaining model integrity and unbiased performance estimation.
Machine learning model optimization and hyperparameter tuning
To ensure model transparency and reproducibility, all machine learning algorithms underwent systematic hyperparameter optimization using a grid search combined with 5-fold cross-validation63. The optimized parameter values for each algorithm are summarized in (Table 1). Specifically, random forest (RF), support vector machine (SVM), and gradient boosting machines (GBM) were individually tuned to maximize predictive performance based on minimizing the Root Mean Square Error (RMSE). Subsequently, an ensemble hybrid model (RF-SVM-GBM) was constructed by integrating the outputs of the three models using a weighted averaging approach, where model weights were assigned proportionally to their coefficient of determination (R²) values on the validation dataset.
Pinpointing the dominant factor
Partial least squares (PLS)
The Partial least squares (PLS) is a multivariate statistical method used to model relationships between independent variables (predictors) and dependent variables (responses)64. Unlike traditional regression techniques, PLS is particularly effective when predictor variables are highly collinear, making it suitable for ecological studies where environmental factors are often correlated65. PLS was applied using the pls package (https://cran.r-project.org/web/packages/pls/index.html) in R (version 4.3.2) to identify the most influential environmental factors affecting the presence of N. persica. The optimal number of latent components was determined using cross-validation to minimize prediction error. The importance of each variable was assessed using Variable Importance in Projection (VIP) scores, where variables with VIP > 1 were considered significant contributors.
PLS reduces dimensionality by extracting latent components (score vectors) that maximize covariance between predictors \(\:\stackrel{-}{X}\) and \(\:\stackrel{-}{Y}\) response. The method follows these steps:
$$\:X={TP}^{T}+E\:\:\:$$
(11)
Where in \(\:X\) is the predictor matrix, \(\:Y\) is the response matrix, \(\:T\) represents score vectors, \(\:P\) and \(\:C\) are loading matrices, \(\:E\) and \(\:F\) are residual matrices.
PLS finds the projection that maximizes the covariance between \(\:\stackrel{-}{T}\)and \(\:\stackrel{-}{Y}\). The regression coefficient is computed as:
$$\:\widehat{\beta}={\left({X}^{T}X\right)}^{-1}{X}^{T}X$$
(13)
\(\:\widehat{\beta}\) represents the estimated regression coefficients. The importance of each predictor \(\:{X}_{j}\) is evaluated using:
$$\:{VIP}_{j}=\sqrt{p\sum\:_{h=1}^{H}\left(\frac{{S}_{h}{W}_{jh}^{2}}{{\sum\:}_{h=1}^{H}{S}_{h}}\right)}$$
(14)
Where \(\:{S}_{h}\) is the explained variance for component \(\:h\), \(\:{W}_{jh}\)is the weight of predictor\(\:{X}_{j}\) in component \(\:h\), \(\:p\) is the total number of predictors.
To determine the optimal number of latent components, we implemented a leave-one-out cross-validation strategy, minimizing the Root Mean Square Error of Prediction (RMSEP). Model performance was assessed using multiple evaluation metrics, including the coefficient of determination (R²), predictive ability (Q²), and RMSEP values. The importance of each predictor was quantified using Variable Importance in Projection (VIP) scores, with variables exceeding a VIP threshold of 1 considered significant contributors. Additionally, a Scree Plot was generated to visualize the variance explained by each PLS component, enhancing the interpretability of the dimensionality reduction process.
Generalized linear model (GLM)
The Generalized Linear Model (GLM) extends the capabilities of traditional linear regression by allowing the response variable to follow distributions beyond the normal, providing flexibility in ecological and environmental modeling where data may deviate from ideal statistical assumptions66,67.
GLM was applied to evaluate the influence of environmental variables on the continuous concentration of nepetalactone in N. persica. The model was implemented using the glm package (https://cran.r-project.org/web/packages/glm.predict/index.html) in R (version 4.3.2).
Given the continuous nature of the response variable, a Gaussian distribution with an identity link function was used. This approach is mathematically equivalent to standard linear regression but was adopted within the GLM framework to accommodate potential deviations from constant variance or normality assumptions in the residuals. The GLM formulation follows:
$$\:g\left(E\left[Y\right]\right)={\beta}_{0}+{\beta}_{1}{X}_{1}+{\beta}_{2}{X}_{2}+\dots\:+{\beta}_{p}{X}_{p}$$
(15)
Where \(\:g\left(0\right)\) is the link function (for continuous responses), \(\:E\left[Y\right]\) is the expected nepetalactone concentration, \(\:{X}_{1}\:to\:{X}_{p}\) represent the environmental predictors, \(\:{\beta}_{0}\) is the intercept, \(\:{\beta}_{i}\) are the regression coefficients.
Model parameters were estimated using maximum likelihood estimation (MLE). The statistical significance of each predictor was evaluated using the Wald test, computed as:
$$\:W=\frac{{\widehat{\beta}\:}_{j}}{SE\left({\widehat{\beta}\:}_{j}\right)}$$
(16)
Where \(\:{\widehat{\beta}\:}_{j}\) is the estimated coefficient for predictor \(\:{X}_{p}\), and SE(\(\:{\widehat{\beta}\:}_{j}\)) is its standard error. Predictors with statistically significant coefficients were interpreted as key environmental drivers of nepetalactone biosynthesis.
This GLM analysis complemented the PLS results by providing an independent statistical assessment of the relationship between environmental variables and metabolite concentration, consistent with established approaches in ecological research68.
