Adaptive urban traffic signal control based on enhanced deep reinforcement learning

Machine Learning


Dueling Networks and Double Q Learning

Duel Network37 and double Q-learning38 The algorithm is an optimized version of DQN: it updates the parameters of a neural network to fit the function. question(s,One,circle), which approximates the optimal action-value function. question*(st,OnetThe duel network improves the neural network structure as shown in Figure 3, splitting one fully connected layer in a fully connected network into two parts and decomposing the predicted optimal action value function. question(s,One,circle) into the state value function Five(s;circle) and the dominance function is(s,One;circle) to more precisely estimate the value of each action. At this point, the Q value is calculated as:

$$Q(s,a;w) = V(s;w) + D(s,a;w) – \frac{1}{|D|}\sum\limits_{a} {D(s,a;w)}$$

(8)

Here, w represents the parameters of the neural network.

Figure 3
Figure 3

To update the parameters of the neural network, a target Q-function is defined that helps guide the updates and represents the target Q-value when taking an action. One In a state sAt this point, the loss function can be defined as:

$$J(w) = \frac{1}{m}\sum\limits_{i}^{m} {\left[ {Q_{t\arg et} (s,a) – Q(s,a;{\text{w}})} \right]^{2} }$$

(9)

where Meters It represents the number of samples drawn from the experience pool. What is unique about Double-Q-learning is that it selects a main network to determine the optimal action, and then uses a separate target network to evaluate the target Q-value for that action. The target network has the same structure as the main network, but with different parameters. circleAt this point, it is calculated using the following formula:

$$Q_{t\arg et} (s,a) = r + \gamma Q\left. {\left( {s^{\prime},\arg \max Q(s^{\prime},a^{\prime};w);w^{ – } } \right)} \right)$$

(Ten)

where circle Represents the parameters of the target network. 'of and 'a' are the state and action at the next time step, respectively. The gradient descent algorithm can use the loss value to update the parameters of the main network.

$$w \leftarrow w – \alpha \cdot \nabla_{w} J(w)$$

(11)

And the parameters circle The weights of the target network are updated by the weighted average.

$${\text{w}}_{new}^{ – } = \tau w_{new} + (1 – \tau )w_{now}^{ – }$$

(12)

τ [0,1] These are hyperparameters that must be tuned manually.

Priority Experience Replay

To solve the problem of slow convergence speed of model training, the PER mechanism is introduced during the model training update process.39The DQN algorithm uses a uniform sampling method where each sample has an equal probability of being selected. However, the importance of each sample is obviously different. Most samples represent normal traffic conditions, while samples that lead to traffic jams deserve more attention. Therefore, in PER, a weight is assigned to each sample, and non-uniform sampling is performed based on these weights. Predicted optimal action value function question(s,One,circle)teeth, question*(st,Onet), large sample |question(s,One,circle)- question*(st,Onet)| should be assigned a higher weight. question*(st,Onet) cannot be obtained directly, so a target Q-function is used instead. The sample weights δ are defined as

$$\delta_{j} = \left| {Q(s,a;w)_{j} – Q_{t\arg et} (s,a)_{j} } \right|$$

(13)

If the samples are ordered in descending order based on their weights, then the probability of a sample is character Being selected is defined as follows:

$$P_{j} = \frac{1}{rank(j)}$$

(14)

where Rank(j) The sample index characterA smaller index indicates a larger weight. δThis means that the discrepancy between the prediction for this sample and the actual target is large, therefore it should be given a higher priority during training and will be more likely to be selected.

Introduced noise

To improve the adaptability of the model to various traffic flow scenarios, consider introducing noise into the fully connected network. Specifically, we change the original neural network parameters to circle and µ+ ε ξwhere µ, ε and ξThey have the same shape circle symbol µ and ε They represent the mean and standard deviation, respectively, of the neural network parameters trained from the samples. ξrepresents random noise, each of whose elements is sampled independently from a standard normal distribution N(0,1). The symbol “” represents element-wise multiplication. At this point, the functionquestion(s,One, circle ) is updated to:

$$Q\left( {s,a;w} \right) = Q\left( {s,a,\xi ;\mu ,\varepsilon } \right)$$

(15)

So the loss function is updated to:

$$J(\mu ,\varepsilon ) = \frac{1}{m}\sum\limits_{i}^{m} {\left[ {Q_{t\arg et} (s,a) – Q(s,a,\xi ;\mu ,\varepsilon )} \right]^{2} }$$

(16)

The target network function is updated as follows:

$$Q_{t\arg et} (s,a) = r + \gamma Q(s^{\prime},\arg \max Q(s^{\prime},a^{\prime};\mu ,\varepsilon );\mu^{ – } ,\varepsilon^{ – } ))$$

(17)

Gradient descent updates parameters µ and ε Main network:

$$\mu \leftarrow \mu – \alpha \cdot \nabla_{\mu } J(\mu ,\varepsilon )\;\;\;\varepsilon \leftarrow \varepsilon – \alpha \cdot \nabla_{\varepsilon } J(\mu ,\varepsilon )$$

(18)

Parameters µ and ε of the target network is updated according to the following formula:

$$\mu_{new}^{ – } = \tau \mu_{new} + (1 – \tau )\mu_{now}^{ – } \;\;\;\varepsilon_{new}^{ – } = \tau \varepsilon_{new} + (1 – \tau )\varepsilon_{now}^{ – }$$

(19)

By using µ+ ε ξ Replace the original parameters circle The robustness of the model can be significantly improved. During training, εand ξIf not introduced, the resulting parameters are µ If the parameters are exactly equal µthe model can estimate the optimal action-value function more accurately. However,µWhen perturbed, the model output can be highly biased. By forcing the neural network to minimize a loss function,J( µ, ε) during training with noisy parameters makes the model more resistant to interference.µ ,The model can estimate the optimal action-value function quite accurately.

Training the model

To find the policy π that maximizes the expected reward, we need to train the agent to learn an optimal action-value function. question*( st, Onet). The complexity of this function requires fitting a neural network to effectively capture the nonlinear mapping relationship between the state space and the behavior. The neural network structure is divided into two parts: a convolutional network and a fully connected network. The model training process is shown in Figure 3. The convolutional network extracts the vehicle's state information from the traffic environment and flattens it into position information. S1 Speed ​​Information S2These are concatenated with phase information S3 We form a feature vector and feed it into a fully connected network, which finally outputs a Q value, and the agent chooses the action with the largest Q value to control the traffic signal. When the traffic environment changes, the current state sAnd action Oneas well as the following states:s' I received my reward.ris stored in memory in the form of a tuple (s,One, r , s′). When the number of samples in memory reaches a certain threshold, the PER is used to select a batch of samples for training the neural network parameters. By repeatedly training, we update the main network parameters. µandεthey gradually fit a function that approaches the optimal action-value function.question*(st,Onet).



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *