The design of algorithms for multi-agent reinforcement learning (MARL) in imperfect information games (scenarios such as poker, where players act in sequence and cannot see each other’s private information) has historically relied on manual iteration. Researchers identify weighting schemes, discount rules, and equilibrium solvers through intuition and trial and error. Researchers at Google DeepMind propose AlphaEvolve, an LLM-powered evolutionary coding agent that replaces manual processes with automated search.
The research team applies this framework to two established paradigms: Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO). In both cases, the system discovers new algorithm variants that compete with or outperform existing, hand-designed, state-of-the-art baselines. All experiments were performed using the OpenSpiel framework.
Background: CFR and PSRO
CFR is an iterative algorithm that decomposes regress minimization over an information set. At each iteration, we accumulate “counterfactual regrets”, i.e. how much the player could have gained by playing differently, and derive a new policy proportional to the accumulated positive regrets. After many iterations, the time-averaged strategy converges to a Nash equilibrium (NE). Variants such as DCFR (discounted CFR) and PCFR+ (predictive CFR+) improve convergence by applying specific discounting or predictive update rules, all developed by manual design.
PSRO operates at a higher level of abstraction. We maintain a population of policies for each player, construct a payoff tensor (metagame) by computing the expected utility for every combination of population policies, and use a metastrategy solver to generate a probability distribution over the population. The best responses are trained on that distribution and iteratively added to the population. The meta-strategy solver (how the population distribution is calculated) is the central design choice that this paper targets for automatic detection. In all experiments, we use exact best response oracles (computed by value iterations) and exact payoff values for every metagame entry, and remove Monte Carlo sampling noise from the results.
AlphaEvolve Framework
AlphaEvolve is a distributed evolution system that uses LLMs rather than numerical parameters to modify source code. Process: The population is initialized with standard implementations (CFR+ as the seed for the CFR experiment, Uniform as the seed for both PSRO solver classes). At each generation, a parent algorithm is selected based on suitability. That source code is passed to LLM (Gemini 2.5 Pro) with prompts for changes. The resulting candidates are evaluated in a proxy game. Valid candidates are added to the population. AlphaEvolve supports multi-objective optimization. If multiple suitability metrics are defined, one is randomly selected per generation to guide parent sampling.
The fitness signal is the negative exploitability after K iterations and is evaluated on a fixed set of training games: 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liars Dice. The final evaluation will be done on another test set of large games that have not yet been seen.
For CFR, the evolvable search space consists of three Python classes: RegretAccumulator, PolicyFromRegretAccumulator, and PolicyAccumulator. These control regret accumulation, current policy derivation, and average policy accumulation, respectively. This interface is expressive enough to represent all known CFR variants as special cases. For PSRO, the evolvable components are TrainMetaStrategySolver and EvalMetaStrategySolver. This is a meta-strategy solver used during oracle training and exploitability evaluation.
discovered algorithm 1: VAD-CFR
The evolved CFR variant is Volatility Adaptive Discounted CFR (VAD-CFR). rather than the linear average or static discounting used in the CFR family. This search yielded three different mechanisms:
- Discounts that adapt to volatility. Instead of fixed discount factors α and β applied to cumulative regret (as in DCFR), VAD-CFR uses an exponentially weighted moving average (EWMA) of instantaneous regret magnitudes to track the variability of the learning process. When volatility is high, the discount increases, so the algorithm forgets volatile history faster. Lower volatility means more history is retained. The EWMA damping coefficient is 0.1 with base α = 1.5 and base β = −0.1.
- Asymmetric instant boost. Positive momentary regrets are multiplied by 1.1 before being added to cumulative regrets. This asymmetry applies to instantaneous updates rather than accumulated history, making the algorithm more responsive to currently appropriate actions.
- A hard warm start with the weight of regret. Policy averaging is completely deferred until iteration 500. The regress accumulation process continues normally during this phase. Once accumulation begins, policies are weighted by a combination of temporal weight and momentary regret magnitude, favoring high-information iterations in building the average strategy. A threshold of 500 iterations was generated by the LLM without knowing the evaluation range of 1000 iterations.
VAD-CFR is benchmarked against standard CFR, CFR+, linear CFR (LCFR), DCFR, PCFR+, DPCFR+, and HS-PCFR+(30) over 1000 iterations with K = 1000. Exploitability is precisely calculated. 11 In the overall game evaluation, VAD-CFR matches or outperforms the state-of-the-art performance. 10 out of 11 gamesHowever, the 4-player Kuhn Poker is the only exception.
| Also discovered: AOD-CFR Previous trials with different training sets (2-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 4-sided Liars Dice) produced a second variant. Asymmetric Optimistic Discounting CFR (AOD-CFR). It uses a linear schedule for discounting cumulative regrets (α transitions from 1.0 → 2.5 in 500 iterations, β from 0.5 → 0.0), sign-dependent scaling of instantaneous regrets, trend-based policy optimism with an exponential moving average of cumulative regrets, and polynomial policy averaging with an exponent γ scaling from 1.0 → 5.0. The research team reports that they achieved competitive performance using a more traditional mechanism than VAD-CFR. |
Discovered algorithm 2: SHOR-PSRO
The evolved PSRO variant is the Smoothed Hybrid Optimistic Regret PSRO (SHOR-PSRO). This search generated a hybrid meta-solver that builds a meta-strategy. Linearly blends the two components at each iteration of the internal solver.
- σ_ORM (optimistic regret matching): Provides stability to minimize regrets. Gains are computed, normalized and diversity adjusted as needed, and then used to update cumulative regress through regress matching. A momentum term is applied to the payoff gain.
- σ_Softmax (smoothed best pure strategy): Boltzmann distribution for pure strategies biased towards high-payoff modes. The temperature parameter controls the concentration. Lower temperatures mean that the distribution is more concentrated in the purest strategies.
| σ_hybrid = (1 − λ) · σ_ORM + λ · σ_Softmax |
During training, the solver uses a dynamic annealing schedule over the outer PSRO iterations. The mixing coefficient λ anneals from 0.3 → 0.05 (transitioning from greedy exploitation to equilibrium discovery), the diversity bonus decays from 0.05 → 0.001 (enabling early population exploration and later stage refinement), and the softmax temperature decreases from 0.5 → 0.01. The number of iterations of the internal solver also varies depending on the population size. The training solver returns a time-averaged strategy across internal iterations to ensure stability.
During evaluation, the solver uses fixed parameters: λ = 0.01, diversity bonus = 0.0, temperature = 0.001. It performs more internal iterations (base 8000, scaled according to population size), returns the last iteration strategy rather than the average, and achieves a reactive and less noisy exploitability estimate. This training and assessment asymmetry was itself a result of exploration rather than a human design choice.
SHOR-PSRO is benchmarked against Uniform, Nash (via a linear program for a two-player game), AlphaRank, Projected Replicator Dynamics (PRD), and Regret Matching (RM) using K = 100 PSRO iterations. In a complete evaluation of 11 games, SHOR-PSRO matches or exceeds state-of-the-art performance. 8 out of 11 games.
Experimental setup
The evaluation protocol separates the training and test games to assess generalization. The training set for both the CFR and PSRO experiments consisted of a 3-player Kuhn poker, a 2-player Leduc poker, a 4-card Goofspiel, and a 5-sided Liars Dice. The test set used in the main body of the paper consists of 4-player Kuhn Poker, 3-player Leduc Poker, 5-card Goofspiel, and 6-sided Liars Dice. These are larger and more complex variations that have not been seen during evolution. Complete information for 11 matches is included in the appendix. The algorithm is modified after the detection of the training phase and before the test evaluation begins.
Important points
- AlphaEvolve automates algorithm design — Instead of tuning hyperparameters, use Gemini 2.5 Pro as a mutation operator to evolve the actual Python source code of the MARL algorithm to discover entirely new update rules rather than variations of existing update rules.
- VAD-CFR replaces static discounting with volatility awareness — Track the magnitude of momentary regresses via EWMA and dynamically adjust its discount rate. Additionally, we completely delay policy averaging until iteration 500. This threshold was discovered without the LLM being informed that the evaluation range was 1000 iterations.
- SHOR-PSRO automates the transition from exploration to exploitation — By annealing the blending factors between Optimistic Regret Matching and Softmax’s highest pure strategy components throughout training, we eliminate the need to manually adjust when the PSRO metasolver transitions from population diversity to equilibrium refinement.
- Generalizations are tested, but not assumed. — Both algorithms are developed on one set of four games and evaluated on another set of larger, unseen games. VAD-CFR holds in 10 of 11 games. SHOR-PSRO had no readjustments between training and testing 8 out of 11 times.
- The discovered mechanism is not intuitive by design — A hard warm start with 500 iterations, asymmetric boosting of positive regrets, increments of exactly 1.1, separate training/evaluation solver configurations, etc. are not the kinds of choices that human researchers typically arrive at. This is the central argument of this study regarding automatic search in this design space.
Please check paper. Please feel free to follow us too Twitter Don’t forget to join us 120,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.

Michal Sutter is a data science expert with a master’s degree in data science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.
