TD learning with a fixed feature-specific temporal-basis
The original TD learning algorithms assumed that agents can be in a set of discrete labeled states (s) that are stored in memory. The goal of TD is to learn a value function such that each state becomes associated with a unique value (\(V\left(s\right)\)) that estimates future discounted rewards. Learning is driven by the difference between values at two subsequent states, and hence such algorithms are called temporal difference algorithms. Mathematically this is captured by the update algorithm: \(V\left(s\right)\leftarrow V\left(s\right)+\alpha \left(r\left({s}^{{\prime} }\right)+\gamma V\left({s}^{{\prime} }\right)-V\left(s\right)\right)\), where \({s}^{{\prime} }\) is the next state and \(r\left({s}^{{\prime} }\right)\) is the reward in the next state, \(\gamma\) is an optional discount factor and \(\alpha\) is the learning rate.
The term in the brackets in the right-hand side of the equation is called the RPE. It represents the difference between the estimated value at the current state and the estimated discounted value at the next state in addition to the actual reward at the next state. If RPE is zero for every state, the value function no longer changes, and learning reaches a stable state. In experiments that linke RPE to the firing patterns of dopaminergic neurons in VTA, a transient conditioned stimulus (CS) is presented to a naïve animal followed by a delayed reward (also called unconditioned stimulus or US, Fig. 1a). It was found that VTA neurons initially respond at the time of reward, but once the association between stimulus and reward is learned, dopaminergic neurons stop firing at the time of the reward and start firing at the time of the stimulus (Fig. 1b). This response pattern is what one would expect from TD learning if VTA neurons represent RPE5.

a Diagram of a simple trace conditioning task. A conditioned stimulus (CS) such as a visual grating is paired, after a delay ΔT, with an unconditioned stimulus (US) such as a water reward. b According to the canonical view, dopaminergic (DA) neurons in the ventral tegmental area (VTA) respond only to the US before training, and only to the CS after training. c In order to represent the delay period, temporal difference (TD) models generally assume neural “microstates” which span the time in between cue and reward. In the simplest case of the complete serial compound (left) the microstimuli do not overlap, and each one uniquely represents a different interval. In general, though (e.g.: microstimuli, right), these microstates can overlap with each other and decay over time. d A weighted sum of these microstates determines the learned value function V
Source link