Duration between rewards controls the rate of behavioral and dopaminergic learning

Machine Learning


Animals

All experiments and procedures were performed in accordance with guidelines from the National Institutes of Health (NIH) Guide for the Care and Use of Laboratory Animals and approved by the UCSF Institutional Animal Care and Use Committee. In total, 101 adult (>11 weeks at time of experiments; median: 13 weeks) wild-type male and female C57BL/6J mice (JAX; RRID: IMSR_JAX:000664) were used across 13 experimental groups: 30-s ITI (n = 6; 3 F/3 M), 60-s ITI (n = 19; 13 behavior-only: 7 F/6 M and 6 dopamine + behavior: 4 F/2 M), 300-s ITI (n = 6; 3 F/3 M), 600-s ITI (n = 19; 12 behavior-only: 5 F/7 M and 7 dopamine + behavior: 5 F/2 M), 3,600-s ITI (n = 5; all dopamine + behavior: 3 F/3 M), 60-s ITI-few trials (n = 18; 12 behavior-only: 6 F/6 M and 6 dopamine + behavior: 3 F/3 M), 60-s ITI-few trials with context extinction (n = 6; 3 F/3 M), 60-s ITI with CS− (n = 6; 3 F/3 M), 600-s ITI with background milk (n = 6; 3 F/3 M), 60-s ITI-50% (n = 10; 2 behavior-only: 0 F/2 M and 8 dopamine + behavior: 4 F/4 M) 60-s ITI-10% (n = 10; 5 F/5 M all dopamine + behavior), 45-s ISI (n = 8; 4 F/4 M all dopamine + behavior) and 135-s ISI (n = 7; 3 F/4 M all dopamine + behavior). One 60-s ITI mouse implanted with an optic fiber was excluded from all dopamine analyses due to a missed fiber placement (Extended Data Fig. 3). One 60-s ITI mouse that was implanted with an optic fiber failed to learn the cue–reward association (Extended Data Fig. 2c) and was excluded from dopamine analyses comparing behavioral and dopaminergic learning (Fig. 2). Two 60-s ITI-50% mice implanted with dopamine fibers failed to learn the cue–reward association (Fig. 7b and Extended Data Fig. 10f) and were excluded from comparisons against 60-s ITI mice and omission dip calculations (Figs. 7c and 8d–f and Extended Data Fig. 10). While small sample sizes preclude a rigorous analysis of sex differences across all conditions tested, no significant difference in ‘trials to learn’ was found between females and males in either the 60-s or 600-s ITI group, the two conditions best powered to detect differences. Thus, sexes were pooled for all analyses.

All cue–reward conditioned mice were head-fixed during conditioning and underwent surgery before behavior experiments either to implant a custom head ring for head fixation (behavior-only) or to inject viral vector and implant an optic fiber and head ring (dopamine + behavior; see ‘Surgery’). All cue–shock conditioned mice were freely moving during experiments but underwent surgery to inject viral vector and implant optic fibers. Mice were >7.5 weeks old at time of surgery (median, 9 weeks). Following surgery, mice were given at least a week to recover before beginning water deprivation. Mice implanted with optic fibers did not begin experiments until >3.5 weeks following surgery to allow time for virus expression. During water deprivation (cue–reward conditioned mice only), mice were given ad libitum access to food but were water deprived to ~85–90% of pre-deprivation body weight and maintained in that weight range throughout experiments through daily adjustments to water allotment. Mice were weighed and monitored daily for the duration of deprivation. After surgery, mice were randomly assigned to experimental groups and those with only a head ring implant were group housed in cages containing mice from multiple experimental groups, while fiber-implanted mice were single housed. Mice were housed on a reverse 12-h light–dark cycle with lights off from 8:00 to 20:00, and all behavior was run during the dark cycle. The mouse holding room was maintained at ~23–24 °C with 40–50% humidity.

Surgery

Surgery was performed under aseptic conditions. Mice were anesthetized with isoflurane (5% induction, 1–2% throughout surgery) and placed in a stereotaxic device (Kopf Instruments) and kept warm with a heating pad. Before incision, mice were administered carprofen (5 mg per kg body weight, subcutaneously (s.c.)) for pain relief, saline (0.3 ml, s.c.) to prevent dehydration and local lidocaine (1 mg per kg body weight, s.c.) to the scalp for local anesthesia. All mice were implanted with a custom-designed head ring (5-mm inner diameter, 11-mm outer diameter, 3-mm height) on the skull for head fixation. The ring was secured to the skull with dental acrylic supported by screws. Following surgery, mice were given buprenorphine (0.1 mg per kg body weight, s.c.) for pain relief.

To measure dopamine release in a subset of mice, 500 nl of an adeno-associated viral (AAV) vector encoding the dopamine sensor dLight1.3b (AAVDJ-CAG-dLight1.3b, 3.9 × 1013 genome copies (GCs) per ml diluted in sterile saline to a final titer of 3.9 × 1012 GCs per ml or AAV5-CAG-dLight1.3b, 2.4 × 1013 GCs per ml diluted to 4–5 × 1012 GCs per ml) was injected unilaterally into the nucleus accumbens core (from bregma: AP, 1.3; ML, ±1.4; DV, −4.55), in either the right or the left hemisphere, counterbalanced across groups. Viral vectors were injected through a small glass pipette with a Nanoject III (Drummond Scientific) at a rate of 1 nl s−1. The injection pipette was kept in place for 5–10 min to allow diffusion, then slowly retracted to prevent backflow up the injection tract. Following injection, an optic fiber (NA 0.66, 400 μm, Doric Lenses) was implanted 200–350 μm above the virus injection site. Following fiber implant, the head ring was secured to skull as above. After experiments, fiber-implanted mice were transcardially perfused, and brains were fixed in 4% paraformaldehyde. Brains were sectioned at 50 µm and imaged on a Keyence microscope to verify fiber placement. Histology images presented (Extended Data Fig. 3) represent composites imaged with a ×10 objective. Stitched images of full brain slices were then cropped to focus on fiber placements in the nucleus accumbens (Extended Data Fig. 3b) or dorsal striatum (Extended Data Fig. 3c).

Cue–reward conditioning

For experiments in Figs. 1–4, all animals were conditioned with an identical trial structure (see below), differing only in ITI as well as number of trial presentations to keep the total conditioning time roughly equal between groups (~1h). For Figs. 1 and 2, 60-s ITI mice were run for 50 trials a day with a variable ITI with a mean of 60 s (uniformly distributed from 48 s to 72 s). The 600-s ITI mice were run for 6 trials a day with a variable ITI with a mean of 600 s (uniformly distributed from 480 s to 720 s). For Fig. 3, 30-s ITI mice were run for 100 trials a day with a variable ITI with a mean of 30 s (uniformly distributed from 24 s to 36 s). The 300-s ITI mice were run for 11 trials a day with a variable ITI with a mean of 300 s (uniformly distributed from 240 s to 360 s). For Fig. 4, 3,600-s ITI mice were run for 2 trials a day with fixed ITI of 3,600 s and, unlike other groups, the session time lasted 2 h.

For experiments in Fig. 6, additional groups of mice were conditioned with parameters matching those of 60-s and/or 600-s ITI groups to control for the influence of factors that varied along with ITI (IRI) manipulations. The 60-s ITI-few mice were run for 6 trials a day (same as 600-s ITI mice) with a mean ITI of 60 s (uniformly distributed from 48 s to 72 s; same as 60-s ITI) to control for the difference in total trial experiences per day between 60-s ITI and 600-s ITI mice. Unlike other groups, sessions lasted ~6.5 min. The 60-s ITI-few mice with context extinction were conditioned similarly to 60-s ITI-few mice but remained in the experimental context for ~55 min following the end of conditioning trials, matching 600-s ITI group’s time in context and number of cue–reward experiences, while the rate of rewards during trials matched the 60-s ITI group. The 60-s ITI with CS− mice were conditioned similarly to 600-s ITI mice, for 6 (CS+) trials a day with a variable (CS+) ITI with a mean of 600 s (uniformly distributed from 480 s to 720 s); however, during the interval between CS+ trials, distractor CS− cues (0.25-s, 3-kHz constant tone, delivered through a piezo speaker: https://www.adafruit.com/product/1739) were presented. CS− cues were not followed by reward delivery and were delivered on a variable interval (exponentially distributed) with a mean of 60 s to approximate the rate of cue delivery in 60-s ITI mice. All mice could hear and respond to the CS− cue as evidenced by some generalized licking to the CS− during conditioning (Extended Data Fig. 9c,d). The 600-s ITI with background chocolate milk mice were conditioned similarly to 600-s ITI mice but, during the interval between cue–sucrose trials (mean of 600 s, uniformly distributed from 480 s to 720 s), mice received two uncued deliveries of chocolate milk (Nesquik Low Fat Chocolate Milk) separated from the previous sucrose or chocolate milk delivery by a variable interval with a mean of 180 s (uniformly distributed from 144 s to 216 s) to test whether cue–sucrose learning rate is affected by the general or identity-specific rate of rewards. Volume of chocolate milk was calibrated to match that of sucrose reward delivery (2–3 µl). Mice readily consumed chocolate milk rewards upon delivery (Extended Data Fig. 9h).

For experiments in Fig. 7, 60-s ITI-50% mice were conditioned identically to 60-s ITI mice (variable ITI with a mean of 60 s, uniformly distributed from 48 s to 72 s), except rewards were delivered with 50% reward probability for 50 trials with ~25 rewards a day to disambiguate the (CS+) ICI from the IRI. Reducing the reward probability by 50% led to a doubling of the IRI to ~120 s on average across a session while maintaining the ITI and ICI. The 60-s ITI-50% mice were conditioned for 12 days. The 60-s ITI-10% mice were conditioned similarly, but with a 10% probability of reward, increasing the IRI tenfold relative to 60-s ITI mice, similarly to 600-s ITI mice. The 60 s ITI-10% mice were conditioned for 32 days.

Trials (CS+) consisted of a 0.25-s 12-kHz constant tone through a piezo speaker (https://www.adafruit.com/product/1740) followed by a 1-s delay (trace period) after which sucrose-sweetened water (2–3 µl; 15% wt/vol) was delivered through a gravity-fed solenoid to a lick spout in front of the mouse, controlled by custom MATLAB and Arduino scripts64. After each outcome, there was a fixed 3-s period to allow reward consumption. Lick spout was positioned close to the animals such that animals could sense, but were not touched by, delivery of reward. Licks were detected through a complete-the-circuit design and recorded in MATLAB. Occasionally, certain mice would have long unbroken contacts with the spout (as measured by lick off–lick on time, due to grabbing the spout with their hands or not breaking contact with their tongues), occluding our ability to measure multiple licks during the period of contact. This was not corrected for as it generally happened following reward delivery or during the ITI and thus did not affect measurements of cue-evoked licks, our main variable of interest.

Mice were not habituated to the head-fixation apparatus or sucrose delivery before conditioning to minimize uncued reward exposure, which, we hypothesize, could affect retrospective contingency calculations during initial cue–reward learning. For the majority of mice, the first trial was their first experience of liquid sucrose reward. An initial subset of behavior-only 600-s ITI mice (n = 6) ran with a fixed ITI of 600 s and was given a single uncued reward delivery before conditioning on day 1. No gross difference in learning compared to subsequent groups was detected, and data were pooled. For all other groups on day 1, mice were placed in the head-fixation apparatus and conditioning commenced. Because a minority of animals from each condition did not initially consume sucrose at time of reward delivery, for all analyses, ‘trial 1’ was defined as the first trial in which a mouse licked to consume sucrose within 5 s of reward delivery. This design choice did not affect our main conclusions as analyzing ‘trials to learn’ in 30-s–3,600-s ITI mice without dropping any initial trials from analysis shifted the mean learned trial by <1 trial. To appropriately count omission trials, this was not done for partial reinforcement experiments, and trial 1 started with the first trial the animal was presented with regardless of licking behavior.

Mice were run for at least 8 days of conditioning, and trial analyses included the first 800 (30-s ITI), 400 (60-s ITI), 80 (300-s ITI) 40 (600-s ITI, 60-s ITI-few, 60-s ITI-few with context extinction, 60-s ITI with CS−, 600-s ITI with background chocolate milk) or 7–8 (3,600-s ITI) trials. For 60-s ITI-50% and 60-s ITI-10% mice, trial analyses included the first 600 and 1,600 trials, respectively. For omission trial-specific analyses (60-s ITI-50%), all omission trials occurring within the first 600 trials were analyzed (~300).

Cue–shock conditioning

For cue–shock conditioning (Extended Data Fig. 5), two groups of freely moving mice were conditioned with an identical trial structure (15-s cue, 2-s trace period, 1-s shock), but differing in the ISI and number of trials a day. Compared to cue–reward conditioned mice, a longer cue period was necessary to measure freezing during the cue. The 45-s ISI mice received 13 trials a day with a mean ISI of 45.33 s (ITI of 27.33 s), while 135-s ISI mice received 5 trials a day with a mean ISI of 136 s (ITI of 118 s). The range of ISI values for each group was variable in the range of the mean ± 20%. The beginning of each conditioning session started with 300 s before the onset of trial 1 for both groups. This time is included in the analysis of total time to learn. The groups were matched so that each group spent the same amount of conditioning time in the chamber during each session leading to the different trial numbers. Three sessions of conditioning were conducted for each group on three consecutive days. Before the first conditioning day, mice were handled for 2 days and on the third day were acclimated to the photometry cables and habituated in the recording chambers for 20 min.

Conditioning took place in Med-Associates chambers with electric shock grid floors controlled by MED-PC. The cue was a 5-kHz tone (80 dB) and the shock was a scrambled electric shock (0.3 mA) delivered through the floor grid. Both the intensity of tone and shock were measured each day before recording. Each conditioning session was done in the same context (purple light, shock grid bottom, vanilla scent). Top-down videos of the chambers were recorded in each session for movement and freezing analysis.

Fiber photometry

For cue–reward conditioning, fluorescent dLight signals were collected using either a Doric Fiber Photometry Console or pyPhotometry65 system. For both systems, light from 470-nm (~40 µW) and 405-nm (~25 µW) LEDs integrated into a fluorescence filter minicube (Doric Lenses) was passed through a low-autofluorescence patchcord (400 µm, 0.57 NA, Doric Lenses) to the mouse. Emission light was collected through the same patchcord, bandpass filtered through the minicube and measured with a single integrated detector. For the Doric system, excitation LED output was sinusoidally modulated by a Doric Fiber Photometry Console running Neuroscience Studio (v5.4 or v6.4) at 530 Hz (470 nm) and 209 Hz (405 nm). The console demodulated the incoming detector signal producing separate emission signals for 470 nm of excitation (dopamine) and 405 nm of excitation (dopamine-insensitive isosbestic control). Signals were sampled at 12 kHz and subsequently downsampled to 120 Hz following low-pass filtering at 12 Hz. For the pyPhotometry system, 405-nm and 470-nm excitation LEDs were modulated in time, rather than frequency, with separate brief 0.75-ms pulses used to separate isosbestic and signal channels. Data were sampled at 130 Hz and low-pass filtered at 12 Hz to match data from the Doric system. Due to a software error during file saving in the Doric system, the final trial was not recorded on two occasions (one 60-s ITI, one 600-s ITI) and was excluded from analysis. This error occurred either well before (60-s ITI) or well after (600-s ITI) the emergence of learning and thus had minimal effect on the resulting analysis. For one 3,600-s ITI animal, a pyPhotometry system crash during the ITI between trials 1 and 2 on day 5 resulted in ~15 min of photometry data loss during the ITI but did not affect analysis focused on cue and reward delivery epoch. A transistor–transistor logic pulse signaling behavior session start and stop was recorded by the photometry software to sync and align photometry and behavior data across hardware.

For cue–shock conditioning, fluorescent dLight signals were collected using an RWD R821 Tricolor MultiChannel Fiber Photometry System running OFRS software (version 2.0.0.33169). Excitation 470-nm (dopamine, ~40 µW) and 410-nm (isosbestic, ~15 µW) channels were separated through modulation in time. Each signal was turned on and off sequentially at an overall sampling rate of 60 Hz for an effective sampling rate of 30 Hz for the combined signals. Emission signals were filtered through dichroic filters in the system and detected with a CMOS camera. Transistor–transistor logic pulses sent from MED-PC to the photometry system during cue and shock synchronized the photometry signals with behavior.

Analysis

Cue–reward behavior

The behavioral measure of learning here was licking in response to the cue before reward delivery. As mice learn the cue–reward association, cue presentation elicits anticipatory licking behavior toward the reward spout. To measure the cue-evoked change in licking behavior over baseline, the number of licks in the 1.25-s baseline period before cue onset was subtracted from the number of licks in the 1.25-s period from cue onset to reward delivery to calculate the change in licking behavior to the cue (cue-evoked licks). When this number was converted to a rate, it was reported as ‘Δ lick rate to cue’. To binarize cue-evoked licking behavior, we also measured the proportion of mice in each group that made more than one cue-evoked lick on each trial across conditioning (Extended Data Figs. 1 and 2). To visualize average trial licking behavior for each session in example animal plots (Figs. 1d, 2c, 3b, 4b, 7h and 8c and Extended Data Figs. 1a, 4a and 10i,j) or reward delivery aligned group averages (Extended Data Figs. 8f and 9h), lick PSTHs were generated by binning licks into 0.1-s bins, converting to a rate, and averaging across trials. The resulting average lick rate trace was smoothed with a Gaussian filter (sigma = 0.75) to aid visualization.

To calculate the trial at which animals show evidence of learning, we first took the cumsum of the cue-evoked licks23,35,66,67,68. Then drawing a diagonal from beginning to the end of the cumsum curve, we calculated the first trial that occurred within 75% of the maximum distance from the curve to the diagonal, which corresponded to the trial after which cue-evoked licking behavior emerged (Extended Data Fig. 2a–c). This trial was designated the ‘learned trial’. Occasionally after learning, cue licks taper off. If at the calculated learned trial the diagonal line was underneath the cumsum curve, which means that the mouse’s lick behavior was decreasing at that point rather than increasing, we iteratively reran the algorithm by drawing the diagonal from the beginning to the point on the cumsum curve corresponding to the previously calculated trial until at the new calculated trial the diagonal was above the cumsum curve (corresponding to the trial in which lick behavior begins to increase). Note that we use the first trial within 75% of maximum distance rather than the overall maximum distance (which would be the largest inflection point in the curve) to account for variability in post-learning behavior that occasionally caused the maximum distance from the diagonal to be at a point after a mouse has consistently licked to the cue for many trials; however, this choice did not affect the main conclusion of the analysis in Fig. 1 that 600-s ITI mice learn in ten times fewer trials than 60-s ITI mice (Extended Data Fig. 2d). Mice that did not show a > 0.5-Hz average increase in lick rate to cue for at least two sessions were classified as non-learners (Fig. 7b and Extended Data Figs. 2c, 6c and 10f) and were not considered in comparisons of learned trials (Figs. 1g, 3e and 7c and Extended Data Figs. 9b and 10c–e). Due to the lower average lick rates in 60-s ITI-10% animals, compared to all other groups tested, we did not segregate this group into learners and non-learners. Learned trial analyses were run on all animals in this group because despite the lower lick rates, all animals had positive slopes in the cumsum curves of licking behavior demonstrating consistent cue-evoked licking across trials (Extended Data Fig. 10g,h). To determine the ‘rewards to learn’ for animals conditioned with partial reinforcement (Fig. 7 and Extended Data Fig. 10), the trials to learn were calculated, and then the number of rewards delivered before the learned trial were counted for each animal. To measure the steepness of individual animal learning curves, we calculated the abruptness of change at the learned trial as the distance from the cumsum curve to the diagonal described above. This distance was calculated in normalized units where the top of the diagonal was set to equal 1 (Extended Data Fig. 2h). Cumsum data are occasionally displayed divided by the number of trials (yielding a y axis that corresponds to average response across all prior conditioning trials) to better compare across groups that experienced different numbers of trials.

To quantify the relationship between learning rate and IRI (Fig. 3g), the mean trials to learn for 30-s, 60-s, 300-s and 600-s ITI groups were plotted against the IRI (mean ITI + 4.25 s (1.25 s trial period + 3 s consummatory period)) on a log–log plot, and a linear least-squares regression was used to determine the best-fit line yielding the equation: log(trials_to_learn) = (−1.0593)log(IRI) + 3.8753. The slope and intercept determined here were used to calculate the predicted trials or rewards to learn for 3,600-s ITI (3,604.25-s IRI; Fig. 4e,f), 600-s ITI with background milk with a general IRI (604.25 s) or an identity-specific IRI (204.25 s; Extended Data Fig. 9f) and 60-s ITI-50% as predicted by the ICI (64.25 s) or IRI (128.5 s; Extended Data Fig. 10a).

To determine the total conditioning time until learning (Figs. 1h, 3h and 5b), the cumulative duration of all conditioning time (ITI + trial periods) from conditioning start up to, but not including, the trial period following the calculated ‘learned trial’ was summed for each individual animal.

For analysis of ITI lick rates (Extended Data Fig. 8g–k), lick rate was calculated from the period beginning with either the start of the session or the end of the prior consumption bout (consumption bout defined as the period from the first lick following reward delivery through all licks in which the interval between consecutive ‘lick off’ to ‘lick on’ was ≤500 ms) and ending with the onset of the following cue. ITI lick rate before learned trial was calculated as the median of the ITI lick rate for every ITI preceding the animal’s learned trial. Animals that did not show evidence of learning were excluded from this analysis. One additional 600-s ITI mouse with many long (>10-s) contacts with the spout during the ITI across days (presumed to be due to holding the spout) was also excluded from this analysis.

Cue–shock behavior

Our main behavioral measure for cue–shock conditioning was freezing during the shock-predictive cue. To analyze motion and freezing to the cue, top-down videos (640 × 480, ~30 fps) of each conditioning session were analyzed using ezTrack69. Empty-chamber calibration videos were used to determine a motion threshold noise cutoff of 11.5. Motion was calculated as the number of pixels with frame-to-frame grayscale value changes exceeding the motion threshold. Freezing was defined as at least 10 consecutive frames with motion below 500. To analyze freezing to the cue, the percentage of frames coded as ‘freezing’ from cue onset to offset was determined. We subtracted the baseline of the percentage of frames coded as freezing during a baseline time period equivalent to cue duration (15 s) immediately preceding cue onset. Motion to the cue was defined as the average motion from cue onset to offset and was similarly baseline subtracted. To determine the trial at which animals learned the cue–shock association, the same algorithm used to determine the learned trial in cue–reward conditioning was used on the cumsum of the freezing to cue, and this trial was used to calculate total conditioning time until learning, similarly to cue–reward conditioned mice (Extended Data Fig. 5b–d). Similarly to cue–reward conditioning, average cumsum curves are plotted on trial units scaled by the ratio of ISI to display curves as a function of conditioning time (Extended Data Fig. 5f,g).

Dopamine

To analyze the signals, a session-wide dF/F was calculated by applying a least-squares linear fit to the 405-nm signal to scale and align it to the 470-nm signal. The resulting fitted 405-nm signal was then used to normalize the 470-nm signal. Thus, dF/F is defined as dF/F = (470-nm signal − fitted 405-nm signal)/fitted 405-nm signal, expressed as a percentage70. Cue-evoked dopamine was measured as the area under the curve (AUC) of the dopamine signal for 0.5 s following cue onset minus the AUC of the baseline period 0.5 s directly preceding cue onset. Reward-evoked dopamine was measured as the AUC 0.5 s following the first detected lick after reward delivery minus the AUC of the pre-cue baseline period described above. If the onset and offset of a detected lick spanned reward delivery time, the reward AUC was calculated from time of reward delivery. For quantifying dopamine dips in response to omitted rewards (Fig. 8d–f), the AUC of a 2-s baseline window was subtracted from the AUC of a 2-s window beginning 1.25 s following cue onset (time of reward delivery in rewarded trials). A longer duration window was used to measure dips to account for the slower kinetics and broader shape of dips relative to cue responses. All dopamine responses reported in main figures are AUC measurements, but peak measurements are also plotted as a comparison point (Extended Data Figs. 4 and 6). To measure cue and reward peak dopamine responses, the mean dopamine signal during the baseline period was subtracted from the maximum value of the dopamine signal during the cue and reward windows described above for AUC measurements. Similarly to AUC measurements, peak responses were also normalized to the mean of the maximum three reward responses in each animal. To facilitate comparisons across mice with differing levels of virus expression, cue and reward dopamine measurements per mouse were normalized to the average of the three maximum reward responses in that mouse. For omission responses, dopamine measurements were normalized to the individual animal average of the three maximum reward responses recorded in a 2-s window following first lick after reward delivery to match the window used for dip measurements. All presented dopamine values represent these individual maximum reward normalized measurements, aside from example mice in which dopamine is plotted as %dF/F and cumsum plots in Figs. 2h and 7i and Extended Data Fig. 4c in which data are normalized to the maximum value of each animal’s cumsum curve as described below. Maximum rather than initial reward responses were chosen, as the reward response initially increased across early conditioning trials with different numbers of trials until maximum between conditions (Extended Data Figs. 4g,h and 6m,p).

To calculate the trial at which dopamine responses to the cue develop (dopamine learned trial), we took the cumsum of the normalized cue dopamine response described above. A diagonal was drawn from trial 1 through the point on the cumsum curve at 1.5 times the behavior learned trial to account for decreasing cue responses with extended training71. The same algorithm described above to determine the behavior learned trial was run on the cue dopamine curve. The lag between dopamine and behavioral learned trial (the number of trials between the development of dopamine responses to the cue and the emergence of behavioral learning) was defined as the behavior learned trial minus the dopamine learned trial (Fig. 2f). Omission dip learned trials (Fig. 8e,f) were calculated using the same algorithm on omission dip responses to detect the negative-going inflection point across all omission trials. One outlier 60-s ITI-10% mouse was excluded from the analysis due to consistent negative dopamine dips to the cue precluding our ability to detect the point at which the cue-evoked increase emerges (Extended Data Fig. 10j).

For cue–shock conditioned mice, the AUC during the last 14 s of the cue response was used as the main measure of cue-driven dopamine response (Extended Data Fig. 5j–l). The first second of the cue response was not included due to the presence of cue onset responses that were variable across animals and present on the first trial of conditioning. The dopamine learned trial was calculated for these animals using the same algorithm as the one used for cue–reward responses, but it was used to detect the negative-going inflection point in the cumsum curves due to the cue response evolving a dip during conditioning (Extended Data Fig. 5i–l). To average trial PSTHs across animals for cue–shock conditioned mice (Extended Data Fig. 5i), each animal’s PSTH was divided by the average of the three maximum peak values from trial onset through 2 s following shock termination. Two cue–shock conditioned mice (one 45-s ISI and one 135-s ISI) were excluded from dopamine analyses due to the absence of a consistent dip during the cue throughout conditioning (Extended Data Fig. 5l).

For one 60-s ITI dLight animal, during an initial conditioning session, a software crash caused the loss of lick data for 50 trials experienced by the animal. An additional 13 trials were presented to the animal that day and recorded following the crash. Photometry data were recorded for all 63 trials. Because the crash occurred before the emergence of learning and cue-evoked licking behavior (as confirmed by both online observation by experimenter before crash and a −0.14-Hz average cue-evoked change in lick rate for the 13 trials recorded after crash), the 50 trials in which data were lost were coded as 0 cue-evoked licks. All 63 trials the animal experienced were included in analyses.

To visualize the average relationship between dopamine responses and licking behavior across learning in 60-s and 600-s ITI mice with variability in individual learning rates, signals were aligned to the behavior learned trial and plotted through 250 or 25 trials after learning (Fig. 2h and Extended Data Fig. 4c). For aligned cumsum plots, data were normalized by the value from trial 400 (60-s ITI) or trial 40 (600-s ITI).

To quantify the relationship between dopaminergic learning rate and IRI (Extended Data Fig. 6e), the mean dopamine trials to learn 60-s and 600-s ITI groups were plotted against the IRI (mean ITI + 4.25 s (1.25-s trial period + 3-s consummatory period)) on a log–log plot, and the line between means was calculated as was done for behavior yielding the equation: log(trials_to_learn_dopamine) = (−1.0359)log(IRI) + 3.4338. The slope and intercept determined here were used to calculate the predicted trials or rewards to learn dopamine for 3,600-s ITI (3,604.25-s IRI; Extended Data Fig. 6i) and 60-s ITI-50% as predicted by the ICI (64.25 s) or IRI (128.5 s; Extended Data Fig. 10b).

Theory and simulations

ANCCR: intuitive derivation of scaling of retrospective learning rate

We previously proposed a new learning model named ANCCR based on the learning of retrospective associations35. ANCCR operates by identifying cues consistently preceding meaningful events such as rewards. Thus, it learns whether a cue consistently precedes a reward (that is, a retrospective cue–reward association). This retrospective association provides a means to estimate whether the reward consistently follows a cue (that is, the prospective cue–reward association). The core principle of ANCCR is that a cue–reward association is learned and cached as a retrospective predecessor representation (denoted by Mcr), and then converted to a prospective successor representation (denoted by Mcr) using a Bayes’ rule-like normalization: \({M}_{\to {\rm{cr}}}={M}_{\leftarrow {\rm{cr}}}\frac{{M}_{\leftarrow r-}}{{M}_{\leftarrow c-}}\). M←r- is proportional to the baseline rate of a reward in the environment, and M←c- is proportional to the baseline rate of a cue. Here, we provide a quick intuitive derivation of the scaling of learning rate by the IRI. Please note that all the above variables are assumed to be conditioned on the experimental context. Thus, they should be listed as Mcr|context, Mcr|context, Mc-|context, Mr-|context. Because listing these conditional dependences is notationally cumbersome, we omit this in our treatments.

Each term on the right-hand side of the Bayes’ normalization (that is, Mcr, M←r- and M←c-) is updated using a delta-rule-like update in ANCCR. Specifically, in equation (1):

$${M}_{\leftarrow {\rm{cr}}}\equiv (1-\alpha ){M}_{\leftarrow {\rm{cr}}}+{\alpha E}_{c};\text{updates at reward times}$$

(1)

where Ec is the eligibility trace of the cue, α is the learning rate for the retrospective update, and \(\equiv\) is the symbol for update, and equation (2):

$${M}_{\leftarrow x-}\equiv \left(1-{\alpha }_{0}\right){M}_{\leftarrow x-}+{\alpha }_{0}{E}_{x};\mathrm{updatesevery}dt$$

(2)

where x is any event type (for example, cue or reward) and Ex is its eligibility trace. As can be seen, Mcr updates at the time of every reward with a learning rate of α, and M←x- updates every dt (that is, continually) with a learning rate of α0. Both update rules determine a corresponding timescale of history for each variable (Mcr or M←x-)—defined as the timescale over which a past event exerts influence on the current value of Mcr or M←x-. The timescale over which one presentation of the cue influences future values of \({M}_{{\leftarrow {\rm{cr}}}_{,}}\) that is, its timescale of history, depends on both α and how frequently the reward occurs. On the other hand, the corresponding timescale for M←x- depends on α0 and dt.

For the Bayes’ rule-like normalization to work in a (possibly) nonstationary environment, all terms on the right-hand side (that is, Mcr, M←r- and M←c-) should be calculated over the same timescale of history. This is because normalizing the predecessor representation calculated over 1 h (say) by baseline rates of reward and cue calculated over 1 min (say) is obviously inappropriate if the environment has the potential to change during the hour. Thus, the quantitative relationship between learning rate and IRI can be obtained by setting the timescale of history for Mcr, M←r- and M←c- to be equal to each other. As shown in equations (1) and (2), the baseline rates of reward and cue are updated by a delta rule with a baseline learning rate αo continually (that is, every time step dt). Assuming that the time constant of decay of the eligibility trace is very short, a single occurrence of x will have a net influence on M←x- of \({\left(1-{\alpha }_{0}\right)}^{n}\) after n time steps—an exponentially decaying function of time. Equating this influence with an exponential time decay of \({e}^{-\frac{t}{\tau }}\), one can calculate the time constant of decay as \(\tau =\frac{-{dt}}{\mathrm{ln}\left(1-{\alpha }_{0}\right)}\). Thus, the net timescale of history for the calculation of the baseline rate of events (cues or rewards) is \(\frac{-{dt}}{\mathrm{ln}\left(1-{\alpha }_{0}\right)}\), where the numerator is the time interval between consecutive delta-rule updates with learning rate αo. The timescale of history for the predecessor representation Mcr has a similar expression with the numerator being the time interval between consecutive updates, which equals the IRI because updates only occur at reward times, and the learning rate in the denominator equals the learning rate of associative update, α. Thus, the timescale of history for the predecessor representation is \(\frac{-{\rm{IRI}}}{{\rm{ln}}(1-\alpha )}\). Setting both these timescales of history to be equal, one can show that the learning rate of associative update should be \(\alpha =1-{(1-{\alpha }_{0})}^{\frac{{\rm{IRI}}}{{\rm{dt}}}}\). For small learning rates, this expression simplifies to \(\alpha ={\alpha }_{0}\frac{{\rm{IRI}}}{{\rm{dt}}}\).

For a more formal derivation that accounts for the time constant of eligibility trace72, see Supplementary Note 1.

Comparison of learning models

To determine if a model of associative learning could capture the experimentally observed proportional scaling across ITI conditions, we simulated three likely candidates, which all account in some way for the time between cue–reward trial experiences: the microstimulus implementation of TDRL41, Wagner’s SOP42,43 and ANCCR35. For each model, we simulated the experimental conditions for 30 s through 3,600-s ITI and tested combinations of parameters to determine which could best replicate the quantitative, proportional scaling of learning rate by IRI observed experimentally. Our measure against which specific model instances were compared was the ‘trials to learn’ for each of the 30-s, 60-s, 300-s, 600-s and 3,600-s ITI groups. Simulations of each model were based on published versions35,41,43 with adjustments to ANCCR described above and below. To generate behavior, all models assumed that behavior emerged following a threshold crossing by the association quantity, which corresponded to cue ‘value’ in both TDRL and SOP and net contingency (NCcuereward) in ANCCR. Thus, ‘learned trial’ is defined in TDRL and SOP as the first trial when value > threshold, and in ANCCR it is defined as the first trial where net contingency > threshold. While we recognize that action selection would likely involve other processes, the above threshold crossing was implemented to ensure that the models generated ‘learned trials’ through comparable operations. For each combination of parameters for all three models, all five ITI conditions (30 s–3,600 s) were simulated for the number of trials that experimental animals experienced over 8 days of conditioning. Each case was iterated 20 times.

To determine the parameter combination from each model that best fit the experimental data, we calculated the residual sum of squares (RSS) of the trials to learn from each simulated parameter combination against the experimental trials to learn for each IRI (Fig. 3g). RSS was calculated on log-transformed data to account for the wide variation in trials to learn across IRIs. The simulation with the lowest RSS was deemed the best fit to the experimental results.

After determining the best-fit parameter combination for each model, TDRL, SOP and ANCCR, we measured the AIC as AIC = 2k + n × ln(meanRSS), where n = sample size (number of animals), k = the number of parameters in the model, and meanRSS = the mean RSS between trials to learn for that parameter combination and experimental data. A lower AIC value represents a better fit to the data accounting for number of parameters needed to fit. Note that all data presented in figures (Extended Data Fig. 7) and text and used to calculate model weights represent the AIC calculated in which models are not penalized for parameters by substituting k = 0 into the equation above. This yields the formula AIC = n × ln(meanRSS). This was implemented as a conservative measure because only a few parameters from most models had the potential to affect the simulated results and the best-fit model (ANCCR) is the one with the fewest parameters. For AIC values penalizing the total number of parameters per model, refer to Supplementary Table 1.

The best model between TDRL, SOP and ANCCR was then determined as the one with the minimum AIC. A relative weight for each model compared to the best model was then calculated as model weight = e −0.5×(AIC – AICmin). For additional simulation results presented in Extended Data Fig. 7, model weights were calculated using the best-fit AIC from the other two models presented in Fig. 5.

To determine whether the time to learn increased with increasing ITIs in each model, the time to learn for each ITI condition for each simulation was calculated by multiplying the number of trials and the number of ITIs experienced before the first trial following the learned trial by the trial duration and mean ITI duration, respectively, and summing those numbers (Fig. 5e,h,k and Extended Data Fig. 7d,f). To determine if the time to learn for each ITI condition increased with increasing ITI duration, a regression was fit to the time to learn for the 30-s through 600-s ITI groups and slopes were compared to a similar regression fit through the behavior data (Extended Data Fig. 7a).

TDRL simulations

TDRL assumes that animals assign value to each moment following an event (for example, cue) to predict future reward. Each event elicits multiple states, and the value of each time step can be expressed as a weighted sum of activated states at that moment. If the prediction from the previous moment is different from what is experienced in the current moment, the model updates the value of the previous moment based on this RPE, assumed to be signaled by dopamine. Depending on how the model represents a state, TDRL can be further divided into subtypes. Here, for a representative TDRL algorithm, we used the microstimulus model41 because it naturally accounts for the ITI. This model assumes that time states are Gaussian functions of increasing width following each event (cue or reward). The following model parameters were fixed for every iteration: bin size (dt) = 0.25 s (set to cue duration); decay parameter of eligibility trace (\(\lambda\)) = 0.99 (set to a high value to allow rapid credit assignment to earlier states); width of Gaussian function (\(\sigma\)) = 0.08. For the following parameters, we swept across a range to determine whether any combination could best explain proportional IRI scaling of learning rate: threshold for behavior generation = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]; decay parameter of event memory (\(d\)) = [0.8, 0.9, 0.99, 0.999, 0.9999]; temporal discounting factor (\(\gamma\)) = [0.8, 0.9, 0.99, 0.999, 0.9999]; number of states elicited by each event (\(m\)) = [3, 10, 100, 1,000]; learning rate (\(\alpha\)) = [0.001, 0.01, 0.1]. The best-fit-to-behavior parameter combination (Fig. 5c–e and Extended Data Fig. 7b) was: threshold = 0.3, \(\alpha\) = 0.1, \(\gamma\) = 0.99, \(m\) = 3 and \(d\) = 0.9.

For comparisons between models (Fig. 5 and Extended Data Fig. 7), simulations were run for the same number of trials as experimental groups (30 s: 800, 60 s: 400, 300 s: 88, 600 s: 48, 3,600 s: 16). Because value and RPE did not asymptote in all ITI conditions for the best-fit model when simulating the same number of trials as experimental groups, we again ran the simulation with the best-fit parameters for at least 400 trials per ITI group to determine the asymptotic levels of RPE and value. We also searched for the best-fit parameter combination (Extended Data Fig. 7c) when ITI conditions 60 s–3,600 s were run for at least 400 trials. The best-fit TDRL parameter combination when each ITI consisted of at least 400 trials per group was: threshold = 0.4, \(\alpha\) = 0.10, \(\gamma\) = 0.80, \(m\) = 3, \(d\) = 0.999. AIC and model weight comparisons for this model were run against the best-fit SOP and ANCCR models (from Fig. 5f–k and Extended Data Fig. 7e,g).

In principle, a similar rule derived in ANCCR could be applied ad hoc to any model of associative learning. Here, we demonstrate that applying such a rule to TDRL improves fit to experimental results. Using the best-fit model parameters determined during the initial TDRL parameter sweep described above (Fig. 5c–e and Extended Data Fig. 7b), we replaced the learning rate, α, based on the equation α = 1 – e (−k· IRI) and performed another parameter sweep to determine the best fit k. We searched over the range k = [0.00015, 0.0002, 0.00025, 0.0003, 0.00035, 0.0004] (the range matching experimentally observed learning rate) and found that the best fit to behavior results from k = 0.0003 (Extended Data Fig. 7d). Because value and RPE did not asymptote in all ITI conditions for the best-fit model when simulating the same number of trials as experimental groups, we again ran the simulation with the same parameters for 2,400 trials in total to determine the asymptotic cue-evoked RPE for this parameter combination.

SOP simulations

In SOP, cues or rewards evoke processing nodes consisting of many elements. These stimulus representations are dynamic: presentation of a stimulus moves a portion of elements from (only) the inactive (I) state into the primary active state (A1). Elements then decay into the secondary active state (A2; a refractory state) and then decay again back to the inactive state while the stimulus is absent. Elements transition between states according to individually specified probabilities. Cue–reward associations (value) are strengthened and learned when cue elements in A1cue and reward elements in A1reward overlap in time and decreased when cue elements in A1cue and reward elements in A2reward overlap in time. Following learning, cues evoke conditioned responding by directly activating reward elements to their A2 state. One way in which SOP has been hypothesized to explain ITI impact on learning is that more time between trials allows more elements to decay to the inactive state (as opposed to the refractory A2 state), allowing for a greater number of elements to transition to the A1 active state upon next cue and reward presentation. Parameter combinations were swept through to determine if any set of parameters could capture the quantitative scaling observed in the experimental results.

The relevant parameters in the model controlling the transition probabilities from I- > A1- > A2- > I are p1, pd1 and pd2, respectively. p1cs, pd1cs and pd2cs refer to the transition probabilities controlling the cue representation, while p1us, pd1us and pd2us refer to the transition probabilities controlling transitions between reward representation active states. The following parameters were fixed for every iteration of SOP run: the time step (dt) = 0.25 s (set to cue duration); reward magnitude of CS in A1 (r1) = 1; reward magnitude of CS in A2 (r2) = 0.5; scale factor for magnitude of activation for coincidence of CS and unconditioned stimulus (US) in A1(Lplus) = 0.2; scale factor for magnitude of inhibition for coincidence of CS in A1 and US in A2(Lminus) = 0.1; and p1cs = 0.1 and p1us = 0.6 based on previously published work43. The following parameter combinations reflecting the variables hypothesized to drive trial spacing effect were varied: threshold for behavior generation = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]; pd1us = [0.01, 0.2, 0.25, 0.5, 0.75]; pd2us = [0.001, 0.001, 0.01, 0.1]; pd1cs = [0.01, 0.2, 0.25, 0.5, 0.75]; pd2cs = [0.001, 0.001, 0.01, 0.1]. Because SOP implementations make the assumption that pd1 > pd2 (ref. 43; that is, the decay from the A1 active state to the A2 state should be faster than the decay from the A2 active state to the inactive state), we constrained our results to parameter combinations that satisfied this inequality. The parameter combination providing the best fit to behavior (Fig. 5f–h and Extended Data Fig. 7e) was: threshold = 0.1, pd1us = 0.25, pd2us = 0.1, pd1cs = 0.1 and pd2cs = 0.0001. Relaxing this constraint, the best model fit parameters were: threshold = 0.6, pd1us = 0.01, pd2us = 0.01, pd1cs = 0.1 and pd2cs = 0.0001 (Extended Data Fig. 7f).

ANCCR simulations

In ANCCR, we derived a scaling rule for the retrospective learning rate (α) and the eligibility trace time constant (T) from the core principle of Bayes’ rule conversion of a retrospective to a prospective association (Supplementary Note 1). For the simulations considered here, these rules simplify to: \(\alpha =1-{(1-{\alpha }_{0})}^{\frac{{\rm{IRI}}}{{dt}}}\) and T = k× IRI. For ANCCR, the parameters that were swept to identify the best-fit model were: threshold = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8] α0= [1 × 10−4, 8 × 10−5, 6 × 10−5, 4 × 10−5, 2 × 10−5, 1 × 10−5, 8 × 10−6], k = [0.1, 0.3, 0.5, 0.7]. The best-fit parameters were: threshold = 0.4; α0 = 4 × 10−5; k = 0.5. The following parameters were fixed: w = 0.5; dt = 0.2 (same as in ref. 35). The dopamine response to the first reward was relatively high (although this response increases with repeated reward experience, consistent with our previous demonstration35). Two possibilities exist to account for this. One is that there is a Bayesian prior for Mr←r and Mr←-, and the other is that part of the innate meaningfulness of a reward is signaled by dopamine. For simplicity, we assumed the latter and added an innate meaningfulness of 1 to dopamine reward response and 0 to dopamine cue response.

Statistics and reproducibility

No statistical test was used to predetermine sample sizes. Sample sizes were chosen based on n values in similar published studies. Blinding was not possible during data acquisition because experimenters had to use specific conditioning protocols based on grouping. Experimenters were not blind to groups during data analysis, but were blind to group identity during histology for fiber placement verification. Animals excluded from specific analyses are described above and noted in figure legends. Statistical analyses were performed in Python 3.12 using either the scipy.stats (v1.16.2) or Pingouin73 (v0.5.5) packages. Welch’s t-test and Welch’s ANOVA was performed to compare between experimental groups, so as to not assume equal variances between the populations (Fig. 1g,j,k). To test for equality of variances, F-tests were run using a custom script. Nonparametric tests (Kruskal–Wallis H, Mann–Whitney U) were used to compare simulation results due to the presence of conditions with 0 variance and for learned trial comparisons with the 60-s-10% group due to the skewed distribution of their data. For the eight experimental comparisons performed in Fig. 6, the false discovery rate was controlled using the Benjamini–Yakutieli method to adjust P values. For comparison of asymptotic dopamine levels (Fig. 4j) and comparison of regression slopes for time to learn (Extended Data Fig. 7a), a Benjamini–Hochberg procedure was used to adjust P values. All other multiple comparisons were corrected for by adjusting P values with Bonferroni’s correction (Fig. 7j and Extended Data Figs. 7, 8b, 9b,f and 10a,b,k). All statistical tests were two tailed. N values reported represent individual animals or, in the case of simulations, the number of iterations. All linear regressions presented fit with a least-squares method using the ‘scipy.stats.linregress’ function (Fig. 3g and Extended Data Figs. 6e, 7a and 8i–k). For sigmoid fits to cue and omission dopamine responses (Fig. 8d), ‘scipy.optimize.curve_fit’ was used to determine parameters that best fit data to the equation y = L / (1 + np.exp(−k × (x − x0))) + b. Full statistical test information is presented in Supplementary Table 1, including test statistics, n values, degrees of freedom and both corrected and uncorrected P values. Time courses of the cumsum or average of the lick and/or dopamine data are presented as the mean between animals/iterations ± s.e.m. Bar graphs are presented as the mean between animals/iterations ± s.e.m. with individual animal (or iteration) data points. In the box plots (Fig. 7j and Extended Data Fig. 10k), the line represents the median, box edges represent the IQR, and whiskers extend to data within 1.5 times the IQR from the box. Results were considered significant at an alpha of 0.05. *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001; NS (nonsignificant) denotes P > 0.05.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link