Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time

TD learning with a fixed feature-specific temporal-basis

The original TD learning algorithms assumed that agents can be in a set of discrete labeled states (s) that are stored in memory. The goal of TD is to learn a value function such that each state becomes associated with a unique value (\(V\left(s\right)\)) that estimates future discounted rewards. Learning is driven by the difference between values at two subsequent states, and hence such algorithms are called temporal difference algorithms. Mathematically this is captured by the update algorithm: \(V\left(s\right)\leftarrow V\left(s\right)+\alpha \left(r\left({s}^{{\prime} }\right)+\gamma V\left({s}^{{\prime} }\right)-V\left(s\right)\right)\), where \({s}^{{\prime} }\) is the next state and \(r\left({s}^{{\prime} }\right)\) is the reward in the next state, \(\gamma\) is an optional discount factor and \(\alpha\) is the learning rate.

The term in the brackets in the right-hand side of the equation is called the RPE. It represents the difference between the estimated value at the current state and the estimated discounted value at the next state in addition to the actual reward at the next state. If RPE is zero for every state, the value function no longer changes, and learning reaches a stable state. In experiments that linke RPE to the firing patterns of dopaminergic neurons in VTA, a transient conditioned stimulus (CS) is presented to a naïve animal followed by a delayed reward (also called unconditioned stimulus or US, Fig. 1a). It was found that VTA neurons initially respond at the time of reward, but once the association between stimulus and reward is learned, dopaminergic neurons stop firing at the time of the reward and start firing at the time of the stimulus (Fig. 1b). This response pattern is what one would expect from TD learning if VTA neurons represent RPE⁵.