Introduction to approximate solving methods for reinforcement learning

A series on reinforcement learning (RL) following Sutton and Barth’s famous book “Reinforcement Learning” [1].

In my last post, I completed a detailed analysis of the first part of the book mentioned above, which introduces the basic solution techniques that form the basis of many RL techniques. these are: Dynamic programming (DP), Monte Carlo method (MC) and Differenced learning (TD). What distinguishes Parts 1 and 2 of Sutton’s book, and what justifies the distinction, is the constraint on the size of the problem. Part 1 covered tabular solutions, but now we’re going to delve deeper into this fascinating topic and include function approximation.

To be specific, in Part 1 we assumed that the state space of the problem being investigated was small enough that it and the solution found could be represented by a simple table (imagine a table that indicates a certain “goodness”, or value, of each state). Now, in Part II, we’ll remove this assumption and be able to tackle any problem.

And, as we have observed firsthand, this modified setup is very necessary. In the previous post, I was able to learn to play Tic-Tac-Toe, but I already failed at Connect Four. Because the number of states here is on the order of 10²⁰. Alternatively, consider the RL problem of learning a task based on camera images. The number of possible camera images is larger than the number of atoms in the known universe. [1].

These numbers should convince everyone that an approximate solution is absolutely necessary. In addition to allowing us to address such questions, we also provide generalization. The tabular approach treated two close but different conditions completely separately. Approximate solution methods, on the other hand, hope that the function approximation can detect such close states and generalize.

Let’s get started. In the next few paragraphs we will explain:

Explain function approximation
create solutions to such problems
We will discuss various choices of approximation functions.

Overview of function approximation

We now use parameterized functions, as opposed to tabular solutions that used tables to represent value functions, etc.

using the weight vector

v It can be anything, such as a linear function of input values or a deep neural network. Later in this article, we will discuss the various possibilities in detail.

The number of weights is usually much smaller than the number of states, resulting in generalization. When you adjust some weights and update the function, you don’t just update one entry in the table. But this (probably) also affects all other estimates.

Let’s summarize the update rules for some methods that we saw in previous posts.

The MC method assigns the observed return G as an estimate of the state.

TD(0) bootstraps the next state value estimate.

DP uses:

From now on, we will interpret updates of the form s -> u as input/output pairs of the function we want to approximate, and we will use techniques from machine learning, specifically supervised learning, to do this. Tasks that require estimating a numerical value (u) are known as function approximation or regression.

To solve this problem, you can, in theory, resort to any method possible for such a task. We’ll talk about this a little later, but it should be mentioned that such a method has certain requirements. First, you need to be able to handle incremental changes and datasets. RL is different from traditional supervised learning tasks because experience is typically accumulated over time. Additionally, the chosen method must be able to handle non-stationary targets. This is discussed in the next subsection.

Purpose of prediction

In Part I of Sutton’s book, there was no need for predictive goals at all. As it turns out, we were always able to converge to an optimal function that completely describes the value of each state. For the reasons mentioned above, this is no longer possible and we need to define the goal we want to optimize, i.e. the cost function.

Use:

Let’s try to understand this. This is an expectation for the difference between predicted and actual values, which makes intuitive sense and is common in supervised learning. Note that this requires defining a distribution µ that specifies how much we care about a particular state.

Often this is simply the frequency of visits to a state, a measure proportional to the policy distribution, and this is what we focus on in this section.

However, please note that it is not actually clear whether this is the correct goal. At RL, we focus on finding good policies. Although some of our methods may be able to optimize the above objectives very well, they still fail to solve the problem at hand, such as when a policy spends too much time in an undesirable state. Still, as we discussed, we need one such goal, but there are no other possibilities, so we just optimize for this one.

Next, we’ll show you how to minimize this objective.

Minimize prediction goal

The tool chosen for this task is Stochastic Gradient Descent (SGD). Unlike Sutton, I don’t want to go into too much detail here, so I’ll just focus on the RL part. We therefore refer interested readers to the following articles: [1] or other tutorials on SGD/Deep Learning.

However, as a general rule, SGD uses batches (or mini-batches) to compute the gradient of a goal and update the weights in small steps toward minimizing this goal.

So this slope becomes:

Now comes the interesting part. Assume that v_π is not the true target, but a (noisy) approximation thereof. For example, U_t.

It can be seen that if U_t is an unbiased value of v_π, the solution obtained by SGD converges to the local optimum. This is useful. Now you can simply use the MC return as U_t to obtain the first gradient RL method, for example.

It is also possible to use other measures for U_t, in particular using bootstrapping, i.e. using previous estimates. Doing so loses convergence guarantees, but empirically this often works. Such a method is called a semi-gradient method because it only considers the change in weights for the updated values and not the effect on the target.

Based on this, we can introduce TD(0) using function approximation.

A natural extension of this, and a corresponding extension to the n-step tabular method as well, is the n-step semi-gradient TD.

Function approximation method

In the remainder of Chapter 9, Sutton discusses various ways to express approximate functions. Most of the chapters cover linear function approximation and functional design for it, and artificial neural networks are introduced for nonlinear function approximation. Since this blog primarily deals with (deep) neural networks rather than simple linear approximations, and since the astute reader is likely already familiar with the basics of deep learning and neural networks, we will briefly discuss these topics.

Linear function approximation

Still, let’s briefly discuss linear approximations. In this case, the state value function is approximated by the dot product.

Here, states are described by vectors.

– And as you can see, this is linear combination of weights.

Because the representation is simple, there are some sophisticated formulas (and closed-loop representations) for the solution and some convergence guarantees.

Feature construction for linear methods

A limitation of the simple linear function approximation introduced above is that each feature is used individually and features cannot be combined. Sutton cites problematic cart poles as an example. Here, high angular velocity can be good or bad depending on the situation. When the pole is properly centered, quick or jerky movements should be avoided. However, higher speeds may be required if the pole is about to topple over.

Therefore, there is a separate field of research on designing efficient feature representations (although some would argue that with the rise of deep learning, this is becoming less important).

One such expression is polynomial. As an introductory example, imagine that the state vector consists of two parts, s_1 and s_2. Therefore, we can define a feature space.

Then, using this representation, you can also do linear function approximations. That is, we use 4 weights for the 4 newly constructed features and maintain a linear function on the weights overall.

More generally, Features based on polynomials of degree n+1 can be expressed as

where c is an integer in {0 … n}.

Other commonly used bases include Fourier bases, coarse tile coding, and radial basis functions, but as mentioned earlier, we will not discuss them in any further depth at this point.

conclusion

In this post, we take an important step beyond previous posts to deploy RL algorithms “in the wild.” In previous posts, I focused on introducing important, albeit tabular, RL methods. We found that it quickly reached its limits when applied to larger problems, and that an approximate solution was needed.

In this post, we introduced the basics. In addition to allowing us to tackle large-scale real-world problems, these techniques also introduce the generalization that is strongly needed for successful RL algorithms.

We started by introducing a suitable prediction goal and a way to optimize it.

We then introduced the first gradient and semi-gradient RL algorithms for prediction purposes, i.e., to learn the value function of a given policy.

Finally, we discussed various methods for constructing approximation functions.

As always, thank you for reading! If you are interested, please wait for the next post that will explain the corresponding control problem in more detail.

Other posts in this series

References

[1] http://incompleteideas.net/book/RLbook2020.pdf

[2] https://pettingzoo.farama.org/

Source link