Novel multiagent reinforcement learning framework using twin delayed deep deterministic policy gradient for adaptive PID control in boiler turbine systems

K. J. Astrom and R. D. Bell created a third-order non-linear dynamic model for boilers using fundamental principles, accurately simulating the plant’s behaviour. Bell and Astrom boiler is a natural circulation water tube boiler, in which chemical energy (coal) is converted into heat energy, then heat energy (steam) is converted into mechanical energy (turbine shaft movement), and finally mechanical energy is converted into electrical energy. BTS continues to be a significant contributor to global electricity production and energy-intensive industrial processes. The major applications of boiler in the field of control engineering are controller design and tuning, control system validation, control system training and education, optimization and energy efficiency, fault detection and diagnosis, etc. The futuristic applications of the BTS involve energy efficiency, renewable energy integration and sustainability. Control of boilers can involve multiple complexities due to the dynamics and nature of the boiler model. The Bell and Astrom boiler is a simplified representation of the BTS that captures its dynamic behavior. The complexities that are associated with it are non-linear dynamics, constraints due to control with multivariable interactions, and the shrink-swell effect due to non-minimum phase behaviour.

The mathematical model of non-linear BTS

The model for the BTS used in this work is shown in Fig. 3 and the governing equations of the systems are as follows:

$$\begin{aligned} & \dot{x}_{1-\text {BTS}}
(1)

$$\begin{aligned} & \dot{x}_{2-\text {BTS}}
(2)

$$\begin{aligned} & \dot{x}_{3-\text {BTS}}
(3)

$$\begin{aligned} & y_{1-\text {BTS}}
(4)

$$\begin{aligned} & y_{2-\text {BTS}}
(5)

$$\begin{aligned} & y_{3-\text {BTS}}
(6)

$$\begin{aligned} & q_{\text {ev(BTS)}} = \frac{(1 – 0.001538 x_{3-\text {BTS}} )(0.8 x_{1-\text {BTS}} – 25.6)}{x_{3-\text {BTS}} (1.0394 – 0.0012304 x_{1-\text {BTS}} )} \end{aligned}$$

(7)

$$\begin{aligned} & s_{q(BTS)} = (0.854 u_{2-\text {BTS}} – 0.147) x_{1-\text {BTS}} + 45.51 u_{1-\text {BTS}} – 2.154 u_{3-\text {BTS}} – 2.096 \end{aligned}$$

(8)

where \(x_{1\text {-BTS}},\ y_{1\text {-BTS}}
(9)

RL algorithm: background

Reward-based RL has become a useful paradigm for addressing complicated control issues by allowing agents to learn optimal actions via environmental rewards. Trial-and-error learning reinforces environmental state-based actions in RL, which is judged by cumulative rewards⁷⁷. Learning an appropriate state-action mapping maximizes the cumulative discounted reward for the agent. Bellman’s principle of optimality ensures that the agent’s policy evolves to maximize outcomes from any state, independent of initial conditions. RL agents learn via experience rather than repetitive instances, adjusting their methods depending on past actions. Due to their simplicity and efficacy, PID controllers are commonly employed for control systems. In complicated or nonlinear systems, PID controllers require accurate parameter adjustment. RL-based techniques like TD3 and DDPG improve PID controllers by using their capacity to learn optimal control policies in dynamic contexts. The agent in actor-critic RL algorithms TD3 and DDPG has two main components:

These algorithms directly output actions from the actor network, which can reflect physical system control signals like PID parameters. These activities are evaluated by the critic network by estimating future rewards. At each time step, the agent observes the current state (s’), selects an action (a) and receives a reward (r). The environment changes to the next state and the agent refines its policy iteratively. This methodology integrates RL algorithms like TD3 and DDPG with PID control to provide a strong foundation for adaptive and optimal control of complex systems where typical PID tuning methods may fail. This work applies RL-based approaches to tuning PID controllers, showing its potential to solve dynamic and nonlinear problems.

DDPG: exploration and exploitation

Adaptive tuning via DDPG is particularly advantageous in complex, model-free scenarios where classical methods fail to capture the dynamism inherent to the system. In the pursuit of enhancing PID controller adaptability and stability, the integration of DDPG, an RL algorithm, presents a promising approach. DDPG, suited for continuous action spaces, allows for the simultaneous learning of a policy and a Q-function. Actor and critic networks coordinate in a stepwise manner in the DDPG algorithm. In the DDPG algorithm, the actor network proposes an action based on the current state, while the critic network estimates the Q-value. The error between the predicted Q-value and the target Q-value is minimized by the critic learning. The actor updates its policy once the critic evaluates the actor’s actions. By applying Actor-Critic methods, DDPG fine-tunes the PID parameters $k_p$ (Proportional gain), $k_i$ (Integral gain), $k_d$ (Derivative gain) through a policy network (Actor) (Eq. (15)) that suggests control actions and a Q-value network (Critic) (Eq.(14)) that evaluates these actions. Through trial and error, the algorithm optimizes the PID parameters to reduce tracking error and maintain stability without dependency on predefined models. In the context of DDPG for adjusting PID controller parameters, exploration is crucial, especially during the initial stages of learning. It prevents the algorithm from prematurely converging to suboptimal policies by encouraging the evaluation of a wider range of PID parameter settings. This is typically achieved by adding a noise process, such as Ornstein-Uhlenbeck or Gaussian noise, to the actor policy’s output. Such stochasticity in action selection allows the agent to discover and learn from various operational consequences, which is vital for identifying the optimal PID settings across diverse and uncertain system dynamics. On the other hand, exploitation is about using the best strategy that the agent has learned so far. As the agent gradually learns the optimal actions, the balance shifts towards exploitation, enhancing the accumulated knowledge encapsulated in the actor network to select the most effective actions. Ultimately, the aim is to diminish the exploration noise over time, stabilizing the selection of actions and converging towards an optimal policy that adaptively tunes the PID controller, thus reflecting a learned balance between exploration and exploitation. DDPG effectively tracks continuous action spaces and enables model-free control of complicated, nonlinear systems. Deterministic policy allows repeatable control actions. Sensitivity to hyperparameters is a major issue with DDPG and it can also converge slowly and be unstable in high-dimensional situations. The number of layers in the actor and critic network, activation function used for this algorithm is shown in Fig. 4. The pseudocode for DDPG algorithm⁶⁵ is given below.

Pseudocode of DDPG

Input and initialization:

Input: Initial policy parameter $\theta$.

Initially Q-function parameters $\phi$.

Replay buffer D holds past experiences $(s, a, r, s’, d)$.

Step 1: Initialize target parameters:

Set target (targ) parameters to match primary parameters:

$$\begin{aligned} \theta _{\text {targ}} \leftarrow \theta , \quad \phi _{\text {targ}} \leftarrow \phi \end{aligned}$$

(10)

Step 2: Repeat steps 3 to 14 until convergence or max episodes reached

Step 3: Check current state and choose action

Check the current state $s’$.

Choose an action a using policy $\mu _{\theta }(s)$ and exploration noise $\epsilon$

$$\begin{aligned} a = \text {clip}(\mu _{\theta }(s) + \epsilon , a_{\text {Low}}, a_{\text {High}}) \end{aligned}$$

(11)

where $\epsilon$ is Gaussian noise.

Step 4: Execute action

Execute a in the environment.

Step 5: Observe transition

Record the next state $s’$, reward r, and terminal flag D.

Step 6: Store experience

Add experience tuple $(s, a, r, s’, d)$ to replay buffer D.

Step 7: Reset environment if $s’$ is terminal.

Step 8: Check update condition

Follow these steps if it’s time to update.

Step 9: Set the number of updates

Step 10: Randomly sample a batch of transitions

$$\begin{aligned} B = \{(s, a, r, s’, d)\} \subset D \end{aligned}$$

(12)

Step 11: Compute the goal value for each transition

$$\begin{aligned} y(r, s’, d) = r + \gamma (1 – d) Q_{\phi _{\text {targ}}}(s’, \mu _{\theta _{\text {targ}}}(s’)) \end{aligned}$$

(13)

where $\gamma$ is the discount factor.

Step 12: Update the Q-function One step of gradient descent using

$$\begin{aligned} \nabla _{\phi } \frac{1}{|B|} \sum _{(s,a,r,s’,d) \in B} (Q_{\phi }(s, a) – y(r, s’, d))^2 \end{aligned}$$

(14)

Step 13: Update policy using

$$\begin{aligned} \nabla _{\theta } \frac{1}{|B|} \sum _{s \in B} Q_{\phi }(s, \mu _{\theta }(s)) \end{aligned}$$

(15)

Step 14: Soft update of target networks

$$\begin{aligned} & \phi _{\text {targ}} \leftarrow \rho \phi _{\text {targ}} + (1 – \rho ) \phi \end{aligned}$$

(16)

$$\begin{aligned} & \theta _{\text {targ}} \leftarrow \rho \theta _{\text {targ}} + (1 – \rho ) \theta \end{aligned}$$

(17)

where $\rho$ regulates the update rate.

ICMA-TD3 and SCMA-TD3: exploration and exploitation

In this study, the application of the TD3 algorithm for the adaptive tuning of a PID controller is explored. The Twin Delayed Deep Deterministic Policy Gradient-TD3 algorithm, renowned for handling the overestimation of Q-values inherent in its predecessor DDPG, is utilized to optimize the PID parameters dynamically. TD3 uses twin critic networks and delayed policy updates to stabilize and reduce reinforcement learning overestimation. By incorporating a pair of critic networks to estimate the Q-function in Eq. (23) and employing delayed policy updates along with target policy smoothing mentioned in Eq. (25) and Eq. (26), the TD3 algorithm ensures a robust and stable adaptation of the PID controller to varying conditions. As a result, the PID controller continuously refines its gains based on the feedback received from the controlled system, aiming to achieve and maintain the desired performance without requiring apriori knowledge about the system dynamics. In the context of TD3 algorithm applied to PID controller tuning, exploration refers to the process by which the agent investigates various PID parameters to discover how they affect the performance of the controlled system. Exploitation, on the other hand, involves using the knowledge gained from exploration to choose the PID parameters predicted to offer the best performance. TD3 achieves a balance between exploration and exploitation by using a noise process for the policy’s action output during exploration, ensuring a sufficient variety of PID parameters are tested, and by subsequently exploiting the learned policy to fine-tune the parameters for optimal performance. Two variations of TD3 algorithm are introduced as SCMA-TD3 and ICMA-TD3 algorithm. The underlying logic behind TD3 remains the same, with only a difference in the network structure in the critic. In ICMA-TD3 two individual critics are used for the PID parameters with a shallow network to avoid the complexity and computational cost. In SCMA-TD3, shared critic with deeper structures and activation functions are used for PID parameter tuning. The individual critic structures in the network structure allows the ICMA-TD3 to capture better dynamics and enhance the PID parameters identified for the complex BTS. The number of layers in the actor and critic network, activation function used for these two algorithm is shown in Fig. 4 which shows the difference between the architectures. The pseudocode for TD3 algorithm is given below⁶⁵.

Pseudocode of TD3

Step 1: Input and initialization:

Input: Initial policy parameter $\theta$.

Initially Q-function parameters $\phi_1, \phi_2$.

Replay buffer D holds past experiences $(s, a, r, s’, d)$.

Step 2: Initialize target parameters

Set target parameters to match main parameters:

$$\theta_{\text{targ}} \leftarrow \theta, \quad \phi_{\text{targ},1} \leftarrow \phi_1, \quad \phi_{\text{targ},2} \leftarrow \phi_2$$

(18)

Step 3: Repeat steps 4 to 16 until convergence.

Step 4: Check state and choose Action

Check the current state.

Choose an action a using policy $\mu _{\theta }(s)$ and exploration noise $\epsilon$

$$a = \text{clip}(\mu_{\theta}(s) + \epsilon, a_{\text{Low}}, a_{\text{High}})$$

(19)

Step 5: Execute action

Perform action a in the environment.

Step 6: Observe transition

Record the next state $s’$, reward r, and terminal flag D.

Step 7: Store experience

Add experience tuple $(s, a, r, s’, d)$ to replay buffer D.

Step 8: Reset environment

If $s’$ is terminal, reset the environment.

Step 9: Check update condition

If a predetermined frequency indicates an update, follow the instructions.

Step 10: Perform updates for a predefined number of iterations (j)

Step 11: Sample a batch of transitions

$$B = \{(s, a, r, s’, d)\} \subset D$$

(20)

Step 12: Compute target action for each transition

$$a'(s’) = \text{clip}(\mu_{\theta_{\text{targ}}}(s’) + \text{clip}(\epsilon, -c, c), a_{\text{Low}}, a_{\text{High}})$$

(21)

Step 13: Compute target Vvlue

$$y(r, s’, d) = r + \gamma (1 – d) \min_{i=1,2} Q_{\phi_{\text{targ},i}}(s’, a'(s’))$$

(22)

where $\gamma$ is the discount factor.

Step 14: Update Q-functions

Perform one step of gradient descent on the Q-function loss:

$$\nabla_{\phi_i} \frac{1}{|B|} \sum_{(s,a,r,s’,d) \in B} (Q_{\phi_i}(s, a) – y(r, s’, d))^2, \quad \text{for } i=1,2$$

(23)

Step 15: Update policy

$$\nabla_{\theta} \frac{1}{|B|} \sum_{s \in B} Q_{\phi_1}(s, \mu_{\theta}(s))$$

(24)

Step 16: Soft Update target networks

$$\phi_{\text{targ},i} \leftarrow \rho \phi_{\text{targ},i} + (1 – \rho) \phi_i, \quad \text{for } i=1,2$$

(25)

$$\theta_{\text{targ}} \leftarrow \rho \theta_{\text{targ}} + (1 – \rho) \theta$$

(26)

where $\rho$ regulates the update rate.

A shared critic network structure is used for all agents in SCMA-TD3 which centralizes the value estimation. The shared critic network has more layers than the TD3 critic network. This enhanced depth allows the shared network to record and process more complicated inter-agent interactions and shared environmental dynamics, which is necessary for agent coordination. SCMA-TD3 lowers computing overhead and assures consistent agent action evaluation based on a single environmental perspective by maintaining a single critic network. ICMA-TD3 assigns each agent a critic network, decentralizing the process. The SCMA-TD3 shared critic has more layers than these individual critic networks. This architecture lets ICMA-TD3 focus on each agent’s localized learning and evaluate behaviors depending on their environment interaction. This decentralized structure increases computing complexity due to agent-specific critic networks, but it allows greater flexibility and adaptability to unique agent dynamics. Both implementations use the TD3 algorithm’s strengths–delayed policy updates, target smoothing, and noise regularization–but differ in critic network architecture. SCMA-TD3’s deeper shared critic stresses coordination and inter-agent robustness, while ICMA-TD3’s individual critic networks highlight autonomous learning and network simplicity. In multi-agent RL settings, shared and individual critic designs affect performance, scalability, and computational efficiency, and this methodological comparison illuminates the trade-offs between these characteristics.

Proposed RL-based BTS control

The analysis of RL-based control strategy focuses on complex multivariable BTS with three inputs and three outputs. The main goal is to maximize the PID gains in order to achieve efficient regulation of the process. The configuration of the RL is designed to replicate real-life industrial situations where processes display complex interactions and interdependencies among their variables. The PID controllers are assigned to BTS variables, which are characterized by their proportional (P), integral (I), and derivative (D) gain characteristics and are to be tuned using the RL algorithms. The control scheme of the BTS using RL-based PID is shown in Fig. 5.

$$\begin{aligned} PID = k_p e
(27)

The difficulty lies in coordinating the tuning process across several controllers to guarantee the overall stability and performance of the system. To carry out the tuning procedure, three separate RL agents are utilized, with each agent assigned to a specific PID controller. Each agent is provided with a collection of state observations, including error measurements and system performance indices relevant to its respective PID controller. The agents aim to acquire policies that minimize a predetermined LQG cost function and desired performance requirements through the adjustment of their individual PID gains. Despite the emergence of advanced control methods like fuzzy logic, adaptive mechanisms, and model-based techniques, PID controllers remain dominant because of their simple design and demonstrated ability to provide reliable performance in many operating circumstances. Metrics for evaluating the performance is determined by the effectiveness of the PID controllers and it is assessed using various metrics such as the rate at which they approach the desired value, the extent to which they exceed the desired value and the error that persists in the steady state. These measurements offer a thorough understanding of the effectiveness of the SCMA-TD3, ICMA-TD3 and DDPG algorithms in acquiring suitable PID settings in a multivariable configuration.

RL framework

The Simulink configuration used for both training and evaluating the RL controller is shown in Fig. 6. The multi-agent structure receives feedback from the environment through the observations vector.

Environment design

To effectively teach an agent to follow control signal trajectories, several design elements must be considered when creating the environment. They can be categorized as agent-related or environment-related. Agent-related factors include the composition of the observations vector and reward strategy. Environment-related elements include training techniques, signals, initial conditions, and criteria for terminating episodes.

Training strategy

The RL agents are trained to precisely follow the benchmark trajectory with random quantities of constant signals. The agent is further tasked with acquiring the ability to commence from a randomly initialized value. This combination constitutes an effective and versatile training approach to instruct the agent in tracking control signal trajectories. The MV-BTS is considered for this application and three agents are created for RL algorithm which captures the dynamics and interactions of the system perfectly to follow the benchmark trajectory. So each agent is responsible for each loop along with the interactions present in this highly interacting BTS system.

Observation vector and rewards strategy

The observation vector

$$\begin{aligned} \begin{bmatrix} \int e \, dt \\ e \end{bmatrix}^{T} \end{aligned}$$

(28)

where e = error is utilized in the training of RL controllers to the PID parameters. The reward function for the RL agent is the negative of the LQG cost function, which is given by the equation,

$$\begin{aligned} \text {Reward} = – \left( ( \text {ref signal} – \text {output} )^2 + 0.01 u^2
(29)

The RL agent maximizes this reward, thus minimizing the LQG cost. LQG’s quadratic cost functions penalize large mistakes more than smaller ones. This reduces control effort and enhances stability. For linear systems with Gaussian noise, LQG control gives a theoretically elegant solution that is optimum for the quadratic cost function. Although linear cost functions are straightforward for linear systems, they sometimes lack the desired features of quadratic cost functions. Its quadratic cost makes LQG control a potential foundation for reliable, efficient, and customizable linear system control.