Exogenous variable driven deep learning models for improved price forecasting of TOP crops in India

Machine Learning


Autoregressive integrated moving average with exogenous inputs (ARIMAX)

ARIMAX, an extension of the well-known ARIMA model, integrates external predictors to enhance time series forecasting. ARIMA models are effective in capturing temporal patterns within data, denoted as ARIMA \(\left(p,d, q\right)\) where ‘p’ signifies autoregressive order, ‘d’ represents differencing order, and ‘q’ indicates moving average order. ARIMAX introduces exogenous variables, external factors influencing the time series data, broadening its applicability. ARIMAX models are estimated by fitting the autoregressive, differencing, and moving average components to the historical time series data and incorporating the exogenous variables. The inclusion of exogenous inputs allows the model to account for external factors that can impact the time series, enhancing the accuracy of the forecasts23. ARIMAX models are particularly useful when the time series data exhibit a clear temporal pattern, and there are additional variables that can contribute valuable information for prediction.

The ARIMAX \({\left(p,d, q\right)\left(P, D,Q\right)}_{s}\) model can be expressed mathematically as:

$${Y}_{t}=c+{\phi }_{1}{Y}_{t-1}+{\phi }_{2}{Y}_{t-2}+\dots +{\phi }_{p}{Y}_{t-p}+{\theta }_{1}{\varepsilon }_{t-1}+{\theta }_{2}{\varepsilon }_{t-2}+\dots +{\theta }_{q}{\varepsilon }_{t-q}+{X}_{t}\beta +{\varepsilon }_{t}$$

(1)

Here, \({Y}_{t}\) represents the observed value at time \(t\), \(c\) is a constant term, \({\phi }_{1}, {\phi }_{2},\dots , {\phi }_{p}\) are autoregressive coefficients, \({\varepsilon }_{t-1}, {\varepsilon }_{t-2}, \dots ,{\varepsilon }_{t-q}\) are error terms from past time steps, \({X}_{t}\) represents the exogenous input variables at time \(t\), \(\beta\) represents the coefficients for the exogenous variables, and \({\varepsilon }_{t}\) is the error term at time \(t\).

Multiple linear regression (MLR)

MLR is a fundamental statistical method used for modeling the relationship between a dependent variable \(\left(y\right)\) and two or more independent variables \(\left({x}_{1}, {x}_{2},\dots , {x}_{n}\right)\). The model assumes a linear relationship between the predictors and the response variable, represented mathematically as:

$$y={\beta }_{0}+{\beta }_{1}{x}_{1}+{\beta }_{2}{x}_{2}+\dots +{\beta }_{n}{x}_{n}+\varepsilon$$

(2)

Here, \({\beta }_{0}\) is the intercept, \({\beta }_{0}, {\beta }_{1},{\beta }_{2}, \dots , {\beta }_{n}\) are the coefficients representing the influence of each independent variable, \({x}_{1}, {x}_{2}, \dots ,{x}_{n}\), and \(\varepsilon\) represents the error term accounting for unexplained variability in the data. The goal of Multiple Linear Regression is to estimate the coefficients \(\left({\beta }_{0}, {\beta }_{1},{\beta }_{2}, \dots , {\beta }_{n}\right)\) that minimize the sum of squared differences between the observed \(\left(y\right)\) and predicted \(\widehat{y}\) values.

The regression coefficients are estimated using the method of least squares, where the objective is to minimize the residual sum of squares (RSS), defined as:

$$RSS=\sum_{i=1}^{N}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}$$

(3)

where \(N\) is the number of observations. The regression model calculates the predicted values \({\widehat{y}}_{i}\) by multiplying each independent variable \(\left({x}_{1}, {x}_{2}, \dots ,{x}_{n}\right)\) with its corresponding coefficient \(\left({\beta }_{0}, {\beta }_{1},{\beta }_{2}, \dots , {\beta }_{n}\right)\), adding the intercept \(\left({\beta }_{0}\right)\), and accounting for the error term \(\varepsilon\). MLR is widely used in various fields to understand the relationships between multiple variables, making it a valuable tool for predictive modeling and data analysis7,41.

Artificial neural networks (ANN)

ANNs represent a category of ML models inspired by the interconnected neurons in the human brain. Particularly proficient in regression tasks, ANNs excel at capturing intricate patterns within data. Comprising layers of interconnected nodes, ANNs include an input layer, one or more hidden layers, and an output layer41. Every connection between nodes possesses a specific weight, and each node processes the weighted sum of its inputs through an activation function.

For regression, the output layer typically consists of a single node, representing the predicted continuous value \(\left(\widehat{y}\right)\). During training, ANNs adjust their weights through a process called backpropagation42,43. This involves computing the error between the predicted output and the actual target values \(\left(y\right)\) and then updating the weights to minimize this error. The objective function minimized during training is often the Mean Squared Error (MSE), which measures the average squared difference between predicted and actual values:

$$MSE=\frac{1}{n}\sum_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}$$

(4)

Here, \(n\) represents the number of data points, \({\widehat{y}}_{i}\) is the predicted value for the \(i\)-th instance, and \({y}_{i}\) is the actual target value.

A fundamental operation in an ANN involves computing the weighted sum of inputs \(\left({z}_{i}\right)\) and subsequently applying an activation function \(\left({a}_{i}\right)\). Activation functions such as the sigmoid function, hyperbolic tangent (tanh), and rectified linear unit (ReLU) are commonly used. These functions introduce non-linearities, allowing ANNs to grasp intricate relationships within the data, enhancing their ability to learn complex patterns. The output \(\left({\widehat{y}}_{i}\right)\) of the \(i\)-th node in the network is computed as:

$${\widehat{y}}_{i}={a}_{i}\left({z}_{i}\right)$$

(5)

ANNs are capable of learning intricate patterns from data, making them suitable for various regression tasks. By adjusting the weights and biases through the training process, ANNs can approximate complex functions, allowing them to model and predict continuous outcomes accurately44,45. Their ability to capture non-linear relationships makes them a valuable tool in regression analysis within diverse fields.

Support vector machines (SVM)

Support Vector Regression (SVR) stands as a potent ML algorithm extensively employed for regression tasks. In SVM regression, the primary aim is to identify a hyperplane that optimally fits the data while maximizing the margin, defined as the distance between the hyperplane and the nearest data points, referred to as support vectors. The objective involves minimizing prediction errors while accommodating a specified margin of tolerance46,47.

Mathematically, SVM regression aims to find a function \(f\left(x\right)\) that predicts the target values \(\left(y\right)\) based on input features \(\left(y\right)\). The objective function for SVM regression is defined as follows:

$$Minimize \frac{1}{2}{\Vert w\Vert }^{2}+C\sum_{i=1}^{n}{\left(max\left(0, \left|{y}_{i}-f\left({x}_{i}\right)\right|-\epsilon \right)\right)}^{2}$$

(6)

Here, \(w\) represents the weights, \(C\) is the regularization parameter that controls the trade-off between minimizing the error and maximizing the margin, \(\epsilon\) is the margin of tolerance, and \(\left({x}_{i},{y}_{i}\right)\) are the input–output pairs in the training dataset. The function \(f\left(x\right)\) is determined by the dot product between the input features and the weights, i.e., \(f\left(x\right)=\langle w,x\rangle +b\), where \(b\) is the bias term.

SVR identifies the optimal hyperplane by solving a constrained optimization problem, ensuring that the errors are minimized while maintaining a balance between fitting the data and achieving a wide margin. The support vectors, which are the data points closest to the hyperplane, influence the final model. SVM regression is effective in capturing non-linear relationships through kernel functions, allowing the algorithm to map the input features into a higher-dimensional space where a linear hyperplane can be more effectively applied48. This ability to handle non-linear data patterns makes SVM regression a versatile technique for various regression tasks in different fields of research and analysis.

Random forest (RF)

RF is a robust regression technique widely used in data analysis. Unlike traditional regression methods, RF combines the predictive power of multiple Decision Tree (DT) algorithms to create an ensemble model. In RF regression, when provided with an input vector \(\left(x\right)\) containing various evidential features for a specific training area, RF constructs a set of \(K\) regression trees and averages their results. The RF regression predictor \(\left({\widehat{f}}_{k}\left(x\right)\right)\) for an input vector \(x\) is calculated by averaging the predictions of individual trees as follows:

$${\widehat{f}}_{k}\left(x\right)=\frac{1}{K}\sum_{k=1}^{K}T\left(x\right)$$

(7)

Here, \(T\left(x\right)\) represents the individual regression trees grown by RF. To enhance diversity among these trees and prevent correlation, RF employs a technique called bagging. In bagging, training data subsets are created by randomly resampling the original dataset with replacement. This process involves selecting data points from the input sample to generate subsets \(\left\{h\left(x,{\Theta }_{k}\right),k=1, \dots , K\right\}\), where \(\left\{{\Theta }_{k}\right\}\) are independent random vectors with the same distribution. Some data points may be repeated, while others might not be used, increasing stability and prediction accuracy, especially in the face of slight variations in input data.

A notable characteristic of RF lies in its ability to select the optimal feature/split point from a randomly chosen subset of features for each tree, reducing inter-tree correlation and minimizing generalization errors. RF trees grow without pruning, ensuring computational efficiency, and utilize out-of-bag elements to evaluate performance without external test data. As the number of trees increases, the generalization error converges, preventing overfitting. Moreover, RF offers insights into the importance of different features, aiding accurate predictions in regression tasks49,50.

Extreme gradient boosting (XGBoost)

XGBoost, an advanced ML algorithm, stands out for its efficiency in regression tasks. Unlike traditional techniques, XGBoost employs a gradient boosting framework that sequentially builds multiple decision trees to refine predictions. In regression, XGBoost minimizes the objective function, which is the sum of a loss function and a regularization term, to find the optimal prediction model. The objective function for XGBoost regression is defined as follows:

$$Objective= \sum_{i=1}^{n}\left(\frac{1}{2}{.\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}+\lambda .\Omega \left(f\right)\right)$$

(8)

Here, \({y}_{i}\) represents the actual target value, \({\widehat{y}}_{i}\) is the predicted value, and \(n\) is the number of data points. The term \(\Omega \left(f\right)\) represents the regularization function, and \(\lambda\) controls the regularization strength.

XGBoost’s power lies in its iterative approach. It starts with an initial prediction \({\widehat{y}}_{i}^{\left(0\right)}\) and then updates it at each iteration by adding the prediction from a new decision tree:

$${\widehat{y}}_{i}^{\left(t\right)}={\widehat{y}}_{i}^{\left(t-1\right)}+{f}_{t}\left({x}_{i}\right)$$

(9)

Here, \(t\) denotes the current iteration, \({f}_{t}\left({x}_{i}\right)\) is the prediction from the \(t\)-th tree for input \({x}_{i}\), and \({\widehat{y}}_{i}^{\left(t\right)}\) is the updated prediction.

To build accurate trees, XGBoost optimizes the structure by selecting the best split points based on the gradient of the loss function. It calculates the first-order and second-order gradients for each instance and uses these values to find the optimal splits. Additionally, XGBoost incorporates a regularization term, controlling the complexity of individual trees, preventing overfitting, and enhancing generalization. By combining the predictions from multiple trees and continuously refining them, XGBoost provides highly accurate regression models, making it a powerful choice for various data analysis tasks51.

Neural basis expansion analysis for interpretable time series forecasting with exogenous variable (NBEATSX)

In a thorough examination, the NBEATSX framework dissects the target signal’s objective by utilizing local nonlinear projections of the target data onto basis functions within specific blocks29. The model’s architecture, as depicted in Fig. 1, consists of multiple blocks, each comprising a Fully Connected Neural Network (FCNN). These FCNNs are responsible for learning expansion factors for both forecast and backcast components. The final prediction is formed by aggregating the forecasts, and the backcast model refines inputs for subsequent blocks. These blocks are arranged into stacks, with each stack specializing in different variants of basis functions. For instance, stack A may emphasize the seasonality aspect, stack B may focus on the trend, and stack C may concentrate on exogenous factors. This arrangement ensures that the output from stack A captures seasonality, while the outputs from stacks B and C represent the trend and exogenous elements, respectively, contributing to the interpretability of the ANN.

Figure 1
figure 1

Architecture of NBEATSX model.

Let the \(l\times T\) vector \({\varvec{y}}\) be the variable to be forecasted and the \(T\times M\) matrix \({\varvec{X}}\) be the considered exogenous variables, while \(T\) is the total number of considered time steps, and \(M\) is to total number of exogenous variables. Consider the scenario that NBEATSx will be used to forecast \(H\) time steps of \({\varvec{y}}\) at time \(L\), where \(L=T-H\). Then, the inputs for NBEATSx are \({{\varvec{y}}}^{\text{back }},{{\varvec{X}}}^{\text{back}}\), and \({{\varvec{X}}}^{fut}\), where \({{\varvec{y}}}^{\text{back}}\) is a \(I\times L\) vector composed of \(L\) lagged values of \({\varvec{y}},{{\varvec{X}}}^{\text{back}}\) is the \(T\times M\) matrix composed of \(L\) lagged values of \(J\) exogenous variables and zeros elsewhere, and \({{\varvec{X}}}^{fut}\) is the \(T\times M\) matrix composed of \(T\) values of \(K\) exogenous variables known at time \(L\) and zeros elsewhere. Incidentally, \(J+K=M,{{\varvec{X}}}^{\text{back }}+{{\varvec{X}}}^{fut}={\varvec{X}}\), and the \({\varvec{y}}\) values to be forecasted are denoted as \({{\varvec{y}}}^{\text{for}}\) while the estimated \({{\varvec{y}}}^{\text{for}}\) values are denoted as \(\widehat{{\varvec{y}}}^{{{\text{for}}}}\).

The inputs of the first block consist of \({{\varvec{y}}}^{\text{back}}\), and \({\varvec{X}}\), while the inputs of each subsequent blocks include the residual connections with backcast output of the previous block. Considering the \(b\)-th block of the \(s\)-th stack, the following transformations hold:

$${{\varvec{h}}}_{s,b}={\mathit{FCNN}}_{s,b}\left({{\varvec{y}}}_{s,b-1}^{\text{back }},{{\varvec{X}}}_{s,b-1}\right)$$

(10)

$${{\varvec{\theta}}}_{s,b}^{\text{back }}={\mathit{LINEAR}}^{\text{back }}\left({{\varvec{h}}}_{s,b}\right) \text{ and }{{\varvec{\theta}}}_{s,b}^{\text{for }}={\mathit{LINEAR}}^{\text{for }}\left({{\varvec{h}}}_{s,b}\right)$$

(11)

where, \({{\varvec{h}}}_{s,b}\in {\mathbb{R}}^{{N}_{h}}\) are learned hidden units, and \({{\varvec{\theta}}}_{s,b}^{\text{back }}\in {\mathbb{R}}^{{N}_{s}}\) and \({{\varvec{\theta}}}_{s,b}^{\text{for }}\in {\mathbb{R}}^{{N}_{s}}\) are respectively backcast and forecast expansion coefficients linearly estimated from \({{\varvec{h}}}_{s,b}\). Afterwards, the following basis expansion operation and doubly residual stacking are performed:

$$\widehat{{\varvec{y}}}_{s,b}^{{\text{back }}} = {\varvec{V}}_{s,b}^{{\text{back }}} {\varvec{\theta}}_{s,b}^{{\text{back }}} {\text{ and }}\widehat{{\varvec{y}}}_{s,b}^{{\text{for }}} = {\varvec{V}}_{s,b}^{{\text{for }}} {\varvec{\theta}}_{s,b}^{{\text{for }}}$$

(12)

$${\varvec{y}}_{s,b + 1}^{{\text{back }}} = {\varvec{y}}_{s,b}^{{\text{back }}} – \widehat{{\varvec{y}}}_{s,b}^{{\text{back }}} {\text{ and }}\widehat{{\varvec{y}}}_{s}^{{\text{for }}} = \mathop \sum \limits_{b = 1}^{B} \widehat{{\varvec{y}}}_{s,b}^{{\text{for }}}$$

(13)

where, \({{\varvec{V}}}_{s,b}^{\text{back }}\in {\mathbb{R}}^{L\times {N}_{s}}\) and \({{\varvec{V}}}_{s,b}^{\text{for }}\in {\mathbb{R}}^{L\times {N}_{s}}\) are the block’s basis, with the possible types of basis being trend basis, \({\varvec{T}}\), seasonal basis, \({\varvec{S}}\), identity basis, \({\varvec{I}}\), and exogenous basis, \({\varvec{X}}\). The doubly residual stacking helps with the optimization procedure and forecast precision as it prepares the inputs of the subsequent layer and allows the \(s\)-th stack to sequentially decompose the modeled signal.

\({\varvec{T}}=\left[1,\mathbf{t},\dots ,{\mathbf{t}}^{{N}_{\text{pot }}}\right]\in {\mathbb{R}}^{H\times \left({N}_{\text{pol }}+1\right)}\), where \({N}_{\text{pol}}\) is the maximum polynomial degree chosen as a hyperparameter, and \({\mathbf{t}}^{\user2{\prime }} = \left[ {0,1,2, \ldots ,H – 1} \right]/H.S = \left[ {1,{\text{cos}}\left( {2\pi \frac{{\text{t}}}{{N_{hr} }}} \right), \ldots ,{\text{cos}}\left( {2\pi \left[ {\frac{H}{2} – 1} \right]\frac{{\text{t}}}{{N_{hr} }}} \right)} \right.\), \(\left.\text{sin}\left(2\pi \frac{\text{t}}{{N}_{hr}}\right),\dots ,\text{sin}\left(2\pi \left[\frac{H}{2}-1\right]\frac{\text{t}}{{N}_{hr}}\right)\right]\in {\mathbb{R}}^{H\times (H-1)}\), where the hyperparameter \({N}_{hr}\) controls the harmonic oscillations. \({\varvec{I}}={I}_{H\times H}\), where \({I}_{H\times H}\) is the identity matrix. Finally, \({\varvec{X}}=\left[{{\varvec{X}}}_{1},\dots ,{{\varvec{X}}}_{M}\right]\in\) \({\mathbb{R}}^{H\times M}\). When using \({\varvec{X}}\), the basis expansion operation can be thought as a time-varying local regression. Finally, \(\widehat{{\varvec{y}}}^{{{\text{for}}}}\) is estimated by adding all stack predictions, \(\sum_{s = 1}^{s} \widehat{{\varvec{y}}}_{s}^{{{\text{for}}}}\), as shown in the yellow rectangle of Fig. 1.

Transformer with exogenous variable (TransformerX)

The proposed model harnesses the transformative capabilities of the Transformer architecture, a DL approach that has garnered significant acclaim in the field of natural language processing. Its widespread recognition stems from its ability to effectively handle sequential data tasks, owing to its capacity for parallel computations and efficient attention mechanisms. Departing from traditional recurrent models such as Recurrent Neural Networks (RNNs), LSTM and Gated Recurrent Units (GRUs), the Transformer architecture takes a novel approach by solely relying on attention mechanisms to capture long-range dependencies within the input data. This departure from sequential constraints enables the model to excel in sequence-to-sequence tasks, including machine translation, text summarization, and time series forecasting. By leveraging the full context of the input sequence, the Transformer model can dynamically allocate attention to the most relevant information, thereby enhancing its capability to model complex relationships and intricate patterns within the data.

Overview of the transformer architecture

The Transformer model follows an encoder-decoder structure (shown in Fig. 2), where the encoder processes the input sequence and generates a continuous representation, while the decoder utilizes this representation to produce the output sequence. The distinguishing feature of the Transformer model is its self-attention mechanism, which enables the model to weigh the importance of different parts of the input sequence when computing the representation for a specific part of the sequence 52.

Figure 2
figure 2

Transformer architecture.

Key components of the Transformer:

  1. 1.1.

    Input embedding and positional encoding The input sequence is first transformed into high-dimensional vector representations through an embedding layer. Since the Transformer lacks inherent sequence order information, positional encodings are added to the embeddings to preserve the sequential nature of the data.

    $$Embedding\left({X}_{i}\right)={X}_{i}E$$

    (14)

where,

\({X}_{i}:\) Represents the \({i}^{th}\) item in the input sequence.

\(E:\) The embedding matrix, typically learned during training

$$Positional \,Encoding \,\left(pos, 2i\right)=sin\left(\frac{pos}{{10000}^{2i/d}}\right)$$

(15)

$$Positional \,Encoding \,\left(pos, 2i+1\right)=cos\left(\frac{pos}{{10000}^{2i/d}}\right)$$

(16)

where,

\(pos:\) The position of the item in the sequence.

\(i:\) The dimension index of the positional encoding.

\(d:\) The dimensionality of the embeddings

$${X}^{\prime}=Embedding\left(X\right)+Positional \,Encoding$$

(17)

  1. 2.2.

    Multi-head self-attention The self-attention mechanism is the core component of the Transformer model. It allows the model to attend to different parts of the input sequence by computing attention weights based on the similarity between the query (current position) and key (other positions) vectors. Multiple attention heads are employed in parallel, each capturing different aspects of the input data53. When represented in matrix form, the self-attention operation in Transformers can be expressed as follows:where, \(Q\) and \(K\in {\mathbb{R}}^{{s}_{1}\times n}\) and \(V\in {\mathbb{R}}^{s\times n}\), \(Z\in {\mathbb{R}}^{s\times n}\) and \(T\) depicts the transpose mechanism.

    $$Z=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$

    (18)

  1. 3.

    Encoder and decoder blocks Both the encoder and decoder consist of multiple stacked layers. Each encoder layer includes a multi-head self-attention sublayer and a feedforward neural network sublayer. The decoder layers additionally have an encoder-decoder attention sublayer that attends to the output of the encoder 54.where,

  2. 4.

    Residual connections and layer normalization To improve training stability and convergence, residual connections and layer normalization are applied within each encoder and decoder layer55.where,

    $$Output=Sublayer\left(x\right)+x$$

    (19)

    $$LayerNorm\left(x\right)=\gamma \left(\frac{x-\mu }{\sigma }\right)+\beta$$

    (20)

\(\mu\) and \(\sigma\) are the mean and standard deviation of the features.

\(\gamma\) and \(\beta\) are learnable parameters.

  1. 3.5.

    Feedforward neural network After the self-attention and encoder-decoder attention sublayers, a feedforward neural network is employed to further process the input representations.

    $$FFN\left(x\right)=max\left(0,x{W}_{1}+{b}_{1}\right){W}_{2}+{b}_{2}$$

    (21)

where,

\({W}_{1}\) and \({W}_{2}\) are weight matrices and \({b}_{1}\) and \({b}_{2}\) are bias vectors.

  1. 4.6.

    Output generation For tasks like machine translation, the decoder output is passed through a linear layer and a softmax activation function to generate the final output sequence56,57.

    $$Y=DecoderOutput*W+b$$

    (22)

where,

\(DecoderOutput:\) The output from the final decoder block.

\(W:\) Weight matrix of the linear layer.

\(b:\) Bias vector

$$Softmax\left({z}_{i}\right)=\frac{{e}^{{z}_{i}}}{{\sum }_{j}{e}^{{z}_{j}}}$$

(23)

where,

\({z}_{i}:\) The \({i}^{th}\) element of the output vector from the linear layer.

\({e}^{{z}_{i}}:\) The exponential function applied to \({z}_{i}\)

\({\sum }_{j}{e}^{{z}_{j}}:\) The sum of the exponentials of all elements of all elements in the output vector.

The Transformer model has proven to be highly effective in various natural language processing tasks and has inspired numerous variants and adaptations for other domains, such as computer vision and time series forecasting.

Transformer with exogenous eariable

  1. a.

    Transformer Encoder:

  2. i.

    Input Embedding: The model begins by embedding the input sequences of historical prices and exogenous variables. These embeddings are then passed through the encoder layers.

  3. a

    Multi-Head Attention: Multi-Head Attention serves as the cornerstone of the Transformer architecture, enabling the model to assess the importance of distinct input sequence components. This mechanism encompasses three distinct linear transformations to generate query, key, and value vectors, which are subsequently partitioned into multiple heads, facilitating simultaneous attention to diverse positions within the input.

  4. b

    Position-wise Feed-Forward Network: Following the attention mechanism, the output undergoes processing through a position-wise feed-forward neural network, applied independently at each position. This crucial step enables the model to capture intricate patterns within the data.

  5. c

    Layer Normalization and Dropout: After each sub-layer, layer normalization is applied, followed by dropout for regularization, preventing overfitting during training.

  6. b.

    Transformer Decoder:

  7. i.

    Multi-Head Attention (Two Mechanisms):

  8. d

    First Attention Mechanism: Focuses on the exogenous variables, enabling the model to incorporate additional context from external factors such as precipitation.

  9. e

    Second Attention Mechanism: Focuses on the encoder context, aligning the input sequence with the target prediction and capturing the relevant historical information.

  10. f

    Position-wise Feed-Forward Network: Similar to the encoder, the decoder uses a position-wise feed-forward network for further processing.

  11. g

    Layer Normalization and Dropout: Similar to the encoder, layer normalization and dropout are applied after each sub-layer for regularization and stability.

  12. c.

    Output Layer:

The final layer of the model is a dense layer with linear activation. It produces the predicted output for the next time step, representing the forecasted prices based on the historical prices and exogenous variables.

Ethics approval, consent to participate and consent for publication

The manuscript does not report on or involve the use of any animal or human data and “not applicable” in this section.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *