Discovery of the reward function for embodied reinforcement learning agents

Machine Learning


Preliminaries

Notations

Throughout the article, the following symbols are used. \({{{\mathcal{S}}}}\): state space. \({{{\mathcal{A}}}}\): action space. π: policy. θ: parameters of π. R: reward function. ω: parameters of R. Π: policy space. \({{{\mathcal{R}}}}\): reward function space. π*: optimal policy. R*: optimal reward function. \({\pi }_{R}^{*}\): learned optimal policy under reward function R. \({\pi }_{{R}^{*}}^{*}\): learned optimal policy under optimal reward function R*, and \({\pi }_{{R}^{*}}^{*}={\pi }^{*}\). \({V}_{R}^{\pi }\): state value of policy π under reward function R. \({Q}_{R}^{\pi }\): state-action value of policy π under reward function R. \({A}_{R}^{\pi }\): state-action advantage of policy π under reward function R. \({G}_{R}^{\pi }\): discounted expected return of policy π under reward function R. ξ: trajectory, contains (s(0)a(0)r(0), …, s(t)a(t)r(t), …), where r(t) = R(s(t)a(t)). δ: a small positive constant. *: absolute value of a real number.

The general RL problem without a designed reward function is the world model denoted as \({{{\mathcal{M}}}}=({{{\mathcal{S}}}},{{{\mathcal{A}}}},{\rho }_{0},T,\gamma )\). The initial state distribution is ρ0, which is the probability distribution over the initial states. The state transfer matrix is \(T:{{{\mathcal{S}}}}\times {{{\mathcal{A}}}}\times {{{\mathcal{S}}}}\to [0,1]\), and \(T(s,a,{s}^{{\prime} })\) specifies the transition probability from state s to \({s}^{{\prime} }\) when the embodied RL agent chooses action a. The discount factor γ [0, 1) determines the relative importance of immediate reward signals compared to future reward signals. We denote the world model \({{{\mathcal{M}}}}\) with a designed reward function R as \({{{\mathcal{M}}}}[R]\), which is a Markov decision process (MDP). In MDP, an agent observes state s, chooses action a based on policy π, receives rewards signal r from reward function R, and optimizes its policy π to \({\pi }_{R}^{*}\) by maximizing \({G}_{R}^{\pi }\):

$${\pi }_{R}^{*}\in \arg {\max }_{\pi \in \Pi }{G}_{R}^{\pi }={{\mathbb{E}}}_{\xi \sim \pi }\left[{\sum }_{t=0}^{\infty }{\gamma }^{t}\cdot R({s}^{
(1)

where ξ ~ π indicates that the trajectory ξ is derived by an agent obey policy π and interact with world model. Here, we assume that when the optimal reward function R* is used to guide the agent’s policy learning, we have \({\pi }_{{R}^{*}}^{*}={\pi }^{*}\).

The optimal reward function problem

Before formally providing the definition of the optimal reward function, we first define the proximal between two reward functions and policies.

Definition 1

(Proximal). For two policies π1 and π2, if the absolute difference between \({G}_{R}^{{\pi }_{1}}\) and \({G}_{R}^{{\pi }_{2}}\) is bounded by a small positive constant, i.e., \(\left\vert {G}_{R}^{{\pi }_{1}}-{G}_{R}^{{\pi }_{2}}\right\vert \le \delta\), then π1 and π2 are proximal with R, denoted as π1 ~Rπ2. For two reward functions R1 and R2, if the absolute difference between \({G}_{{R}_{1}}^{\pi }\) and \({G}_{{R}_{2}}^{\pi }\) is bounded by a small positive constant for all policies π Π, i.e., \(\left\vert {G}_{{R}_{1}}^{\pi }-{G}_{{R}_{2}}^{\pi }\right\vert \le \delta,\forall \pi \in \Pi\), then R1 and R2 are proximal, denoted as R1 ~ R2.

Proposition 1

Given a reward function R, if R is proximal to the optimal reward function R*, then policy \({\pi }_{R}^{*}\) will be proximal to the optimal policy π* with R and R*.

Proof

See Supplementary Theoretical Results. □

Definition 1 shows that when the absolute difference between \({G}_{R}^{\pi }\) and \({G}_{{R}^{*}}^{\pi }\) is bounded by a small constant for any policy π, the reward function R qualifies as an approximately optimal reward function. However, in many practical applications, we are only interested in the performance of the learned optimal policy, i.e., \({\pi }_{R}^{*}\), rather than the entire policy space. Therefore, we relax the optimal condition based on Proposition 1 and only require the expected returns of the policies \({\pi }_{R}^{*}\) and π* are approximately equal under R*. That means, the optimal reward function R* can be obtained through minimizing the absolute difference between the \({G}_{{R}^{*}}^{{\pi }_{R}^{*}}\) and \({G}_{{R}^{*}}^{{\pi }^{*}}\), which is termed as the regret \({\mathfrak{L}}\) in this work:

$${\mathfrak{L}}({\pi }_{R}^{*},{\pi }^{*}| {R}^{*})=\left| {G}_{{R}^{*}}^{{\pi }^{*}}-{G}_{{R}^{*}}^{{\pi }_{R}^{*}}\right| .$$

(2)

Definition 2

A reward function R is defined as the optimal reward function if it minimizes the expected regret \({\mathfrak{L}}({\pi }_{R}^{*},{\pi }^{*}| {R}^{*})\) over the possible reward space \({{{\mathcal{R}}}}\), where the policy \({\pi }_{R}^{*}\) is learned by an embodied RL agent interacting with the world model \({{{\mathcal{M}}}}[R]\). Formally, it is expressed as:

$${R}^{*}=\arg {\min }_{R\in {{{\mathcal{R}}}}}{{\mathbb{E}}}_{{\pi }_{R}^{*} \sim {{{\mathcal{M}}}}[R]}\left[{\mathfrak{L}}({\pi }_{R}^{*},{\pi }^{*}| {R}^{*})\right],$$

(3)

where \({{{\mathcal{R}}}}\) denotes the space of all possible reward functions and \({\pi }_{R}^{*} \sim {{{\mathcal{M}}}}[R]\) indicates that the policy \({\pi }_{R}^{*}\) is derived under the dynamics influenced by the reward function R.

The optimal reward function problem (ORFP) is defined in Definition 2. However, given an MDP \({{{\mathcal{M}}}}[R]\), a direct search for the optimal reward function via equation (3) remains challenging for the following two reasons:

  1. (1)

    In practice, the target optimal reward function R* is unknown, which makes it not possible to directly evaluate and compare regrets of different candidate reward functions.

  2. (2)

    Determination of the optimal policy π* for a given reward function R is not straightforward, rendering it challenging to compute the regret \({\mathfrak{L}}({\pi }_{R}^{*},{\pi }^{*}| {R}^{*})\) via equation (2).

To address the first challenge, an additional reward function \(\overline{R}\) is introduced to provide a unified evaluation of regret. The reward function \(\overline{R}\) can be simply defined as both fixed and sparse. In this work, the function is associated with the ultimate goal of the embodied RL agent and gives a reward when the task is completed. We then measure the regret of different reward functions under the reward function \(\overline{R}\):

$${\mathfrak{L}}({\pi }_{R}^{*},{\pi }^{*}| \overline{R})=\left| {G}_{\overline{R}}^{{\pi }^{*}}-{G}_{\overline{R}}^{{\pi }_{R}^{*}}\right| .$$

(4)

Although many researchers have sought to approximate regret65, direct regret computation and minimization using equation (2) remain challenging. However, noting that the optimization objective of the optimal policy π* under the reward function \(\overline{R}\) is deterministic as \(\overline{R}\) is fixed66, we simplify the objective function in equation (3) and reformulate the process used to discover the optimal reward function from the minimization of regret to maximization of the objective value \({G}_{\overline{R}}^{{\pi }_{R}^{*}}\) by omitting \({G}_{\overline{R}}^{{\pi }^{*}}\):

$${R}^{*}= \arg {\min }_{R\in {{{\mathcal{R}}}}}{{\mathbb{E}}}_{{\pi }_{R}^{*} \sim {{{\mathcal{M}}}}[R]}\left| {G}_{\overline{R}}^{{\pi }^{*}}-{G}_{\overline{R}}^{{\pi }_{R}^{*}}\right|,\\= \arg {\max }_{R\in {{{\mathcal{R}}}}}{{\mathbb{E}}}_{{\pi }_{R}^{*} \sim {{{\mathcal{M}}}}[R]}\left[{G}_{\overline{R}}^{{\pi }_{R}^{*}}\right].$$

(5)

Intuitively, the definition of the optimal reward function in equation (5) holds for most RL problems.

The bilevel optimization framework

We next naturally integrate the process used to discover the optimal reward function of equation (5) into the RL process and define that process within a bilevel optimization framework. As ω and θ represent the parameters of reward function and policy, the bilevel optimization framework is formulated as follows:

$${\omega }^{*}=\arg {\max }_{\omega \in \Omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}(\theta,\omega )\triangleq {{\mathbb{E}}}_{{\pi }_{{\theta }^{*}} \sim {{{\mathcal{M}}}}\left[{R}_{\omega }\right]}\left[{G}_{\overline{R}}^{{\pi }_{{\theta }^{*}}}\right],$$

(6)

$${\theta }^{*}=\arg {\max }_{\theta \in \Theta }{{{{\mathcal{J}}}}}^{{{{\rm{lower}}}}}(\theta,\omega )\triangleq {{\mathbb{E}}}_{s \sim {\rho }^{{\pi }_{\theta }},a \sim {\pi }_{\theta }(\cdot | s)}\left[{\sum }_{t=0}^{\infty }{\gamma }^{t}\cdot {R}_{\omega }({s}^{
(7)

where Ω and Θ denote the spaces of the possible reward function parameters and the policy parameters, respectively. \(s \sim {\rho }^{{\pi }_{\theta }}\) denotes that the state s is sampled from the distribution \({\rho }^{{\pi }_{\theta }}\) and a ~ πθ( s) indicates that action a is sampled from policy πθ( s) given state s. To avoid notation bloat, we use \({\pi }_{{\theta }^{*}}\) below to denote the optimal policy \({\pi }_{R}^{*}\) of reward function R even though R does not appear.

The solutions of the equations ((6)–(7)) can be sought using a straightforward, alternating optimization method that iteratively fixes either ω or θ and optimizes the other. Specifically, in our framework, given the reward function Rω and policy πθ, we begin by fixing the reward function Rω and then optimizing the policy from πθ to \({\pi }_{{\theta }^{{\prime} }}\) according to equation (7). This step is a standard process whereby RL algorithms can be applied to solve lower-level problem. Subsequently, we fix the policy and optimize the reward function Rω to \({R}_{{\omega }^{{\prime} }}\) using equation (6). This alternating process continues until the optimal policy and optimal reward are discovered.

The lower-level problem

Given that the reward function Rω is fixed, the lower-level problem is a standard RL policy optimization problem. In this problem, the embodied RL agent receives reward signals from the upper-level reward function Rω and then interacts with the world model \({{{\mathcal{M}}}}\). Thus, the expected cumulative return \({{{{\mathcal{J}}}}}^{{{{\rm{lower}}}}}(\theta,\omega )\) of the lower-level problem is defined by equation (7), and we rewrite it here:

$${{{{\mathcal{J}}}}}^{{{{\rm{lower}}}}}(\theta,\omega )\triangleq {{\mathbb{E}}}_{s \sim {\rho }^{{\pi }_{\theta }},a \sim {\pi }_{\theta }(\cdot | s)}\left[{\sum }_{t=0}^{\infty }{\gamma }^{t}\cdot {R}_{\omega }({s}^{
(8)

Lemma 1

(ref. 67). Given a standard RL problem denoted by a world model \({{{\mathcal{M}}}}[R]=({{{\mathcal{S}}}},{{{\mathcal{A}}}},{\rho }_{0},T,\gamma,{R}_{\omega })\) with the objective function defined in equation (8), the policy gradient can be defined as follows:

$${\nabla }_{\theta }{{{{\mathcal{J}}}}}^{{{{\rm{lower}}}}}(\theta,\omega )={{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\nabla }_{\theta }\log {\pi }_{\theta }(s,a)\cdot {Q}_{{R}_{\omega }}^{{\pi }_{\theta }}\left(s,a\right)\right],$$

(9)

where \({Q}_{R}^{{\pi }_{\theta }}\) is the state–action value function of πθ in the world model \({{{\mathcal{M}}}}[R]\) and μs,a is shorthand for \(s \sim {\rho }^{{\pi }_{\theta }},a \sim {\pi }_{\theta }(\cdot | s)\).

We directly derive the gradient update form for the policy parameter θ of the lower-level optimization step based on Lemma 1:

$${\theta }^{{\prime} } =\theta+\alpha \cdot {[{\nabla }_{\theta }{{{{\mathcal{J}}}}}^{{{{\rm{lower}}}}}(\theta,\omega )]}_{\theta }\\ =\theta+\alpha \cdot {{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\left[{\nabla }_{\theta }\log {\pi }_{\theta }(s,a)\right]}_{\theta }\cdot {Q}_{{R}_{\omega }}^{{\pi }_{\theta }}(s,a)\right],$$

(10)

where α denotes the learning rate during lower-level policy optimization.

The upper-level problem

The upper-level objective of equation (6) is defined by reference to equation (5), which we here restate:

$${{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}(\theta,\omega )\triangleq {{\mathbb{E}}}_{{\pi }_{{\theta }^{*}} \sim {{{\mathcal{M}}}}\left[{R}_{\omega }\right]}\left[{G}_{\overline{R}}^{{\pi }_{{\theta }^{*}}}\right].$$

(11)

As for lower-level optimization, we optimize the reward parameter ω via gradient updates. For a fixed \({\theta }^{{\prime} }\) and ω, we update the reward parameter as follows:

$${\omega }^{{\prime} }=\omega+\beta \cdot {\left[{\nabla }_{\omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}(\theta,\omega )\right]}_{\omega },$$

(12)

where β is the learning rate of the upper-level problem.

As the policy parameter θ is influenced by the reward parameter ω given during RL, we model the policy parameters as a meta function of the reward parameters, i.e., θ(ω). After constructing this meta function, we establish a relationship between the reward parameters ω and the policy parameters θ, in turn enabling the pseudo-updating of the reward function parameters ω in the upper-level optimization step, as follows:

$${\left[{\nabla }_{\omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\right]}_{\omega }={\left[{\nabla }_{\theta }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\right]}_{{\theta }^{{\prime} }}\cdot {\left[{\nabla }_{\omega }\theta (\omega )\right]}_{\omega }.$$

(13)

Assumption 1

Assume that the policy distribution is continuous and differentiable with respect to the policy parameters.

On the basis of Assumption 1, we prove Theorem 1 that considers optimization of the reward function parameter.

Theorem 1

Given that the bilevel optimization problem in equations ((6)–(7)), the gradient of the upper-level objective \({{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\) with respect to the reward parameters ω is:

$${\left[{\nabla }_{\omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\right]}_{\omega }={{\mathbb{E}}}_{{\mu }_{s,a}^{{\prime} }}\left[{Q}_{\overline{R}}^{{\pi }_{\theta }}\cdot {\left[{\nabla }_{\theta }\log {\pi }_{\theta }\right]}_{{\theta }^{{\prime} }}\right]\cdot {{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\left[{\nabla }_{\omega }{Q}_{{R}_{\omega }}^{{\pi }_{\theta }}\right]}_{\omega }\cdot {\left[{\nabla }_{\theta }\log {\pi }_{\theta }\right]}_{\theta }\right],$$

(14)

where μs,a and \({\mu }_{s,a}^{{\prime} }\) are the policy distributions of the previous and current steps respectively.

Proof

See Supplementary Theoretical Results. □

On the basis of Theorem 1, in the upper-level of the optimization framework, the reward function parameters can be updated as follows:

$${\omega }^{{\prime} } =\omega+\beta \cdot {\left[{\nabla }_{\omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\right]}_{\omega }\\ =\omega+\beta \cdot {{\mathbb{E}}}_{{\mu }_{s,a}^{{\prime} }}\left[{Q}_{\overline{R}}^{{\pi }_{{\theta }^{{\prime} }}}\cdot {\left[{\nabla }_{\theta }\log {\pi }_{\theta }\right]}_{{\theta }^{{\prime} }}\right]\cdot {{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\left[{\nabla }_{\omega }{Q}_{{R}_{\omega }}^{{\pi }_{\theta }}\right]}_{\omega }\cdot {\left[{\nabla }_{\theta }\log {\pi }_{\theta }\right]}_{\theta }\right].$$

(15)

However, it is difficult to obtain the policy gradient for value-based RL such as that of the deep Q-network (DQN)9, and for both value-based and policy-based RL such as the soft actor–critic (SAC)62. Hence, we further develop Theorem 1 and extend the applicability thereof.

Theorem 2

Given the bilevel optimization problem in equations ((6)–(7)), the gradient of the upper-level objective \({{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\) with respect to the reward parameters ω in Theorem 1 can be formulated as:

$${\left[{\nabla }_{\omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\right]}_{\omega }={{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\nabla }_{\omega }{\left[{{\mathbb{E}}}_{{\mu }_{s,a}^{{\prime} }}{A}_{\overline{R}}^{{\pi }_{{\theta }^{{\prime} }}}(s,a)\cdot {A}_{{R}_{\omega }}^{{\pi }_{\theta }}(s,a)\right]}_{\omega }\right],$$

(16)

where \({A}_{R}^{{\pi }_{\theta }}(s,a)\) is the advantage function of policy πθ under the world model \({{{\mathcal{M}}}}[R]\).

Proof

See Supplementary Theoretical Results. □

On the basis of Theorem 2, the optimization term in equation (15) can be formulated as:

$${\omega }^{{\prime} }=\omega+\beta \cdot {{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\nabla }_{\omega }{\left[{{\mathbb{E}}}_{{\mu }_{s,a}^{{\prime} }}{A}_{\overline{R}}^{{\pi }_{{\theta }^{{\prime} }}}(s,a)\cdot {A}_{{R}_{\omega }}^{{\pi }_{\theta }}(s,a)\right]}_{\omega }\right],$$

(17)

where the advantage function \({A}_{R}^{{\pi }_{\theta }}(s,a)\) is defined as the value advantage of the agent that selects action a in state s, expressed as:

$${A}_{R}^{{\pi }_{\theta }}(s,a)={Q}_{R}^{{\pi }_{\theta }}(s,a)-{V}_{R}^{{\pi }_{\theta }}(s),$$

(18)

where \({Q}_{R}^{{\pi }_{\theta }}(s,a)\) represents the expected return of taking action a in state s and \({V}_{R}^{{\pi }_{\theta }}(s)\) denotes the expected return of following policy πθ in state s.

However, we cannot obtain the distribution \({\mu }_{s,a}^{{\prime} }\) when optimizing the reward parameters from ω to \({\omega }^{{\prime} }\), even if the policy has been optimized to \({\pi }_{{\theta }^{{\prime} }}\). The reason is that during RL, we simulate the distribution by allowing the agent to interact with the world model and then store the trajectories. To address this limitation, we approximate the expectations of \({\pi }_{{\theta }^{{\prime} }}\) in equation (17) using samples from πθ as suggested by Lemma 2.

Assumption 2

Assume that the support of policy π covers the support of policy \({\pi }^{{\prime} }\), meaning that for any state-action pair (sa) for which \({\pi }^{{\prime} }(s,a) > 0\), we also have π(sa) > 0.

Assumption 2 guarantees that the importance sampling ratio \(\frac{{\pi }_{{\theta }^{{\prime} }}}{{\pi }_{\theta }}\) of Lemma 2 is always well defined, finite, and computationally stable during optimization.

Lemma 2

(ref. 68). Given two policies π and \({\pi }^{{\prime} }\), the expectation of a function f(sa) under \({\pi }^{{\prime} }\) can be estimated using samples from π and the importance sampling ratio. Specifically,

$${{\mathbb{E}}}_{{\pi }^{{\prime} }}[ \, f(s,a)]={{\mathbb{E}}}_{\pi }\left[\frac{{\pi }^{{\prime} }(s,a)}{\pi (s,a)}f(s,a)\right],$$

(19)

where \(\frac{{\pi }^{{\prime} }(s,a)}{\pi (s,a)}\) is the importance sampling weight, assuming that π(sa) > 0 for all actions a taken under \({\pi }^{{\prime} }\).

Use of the importance sampling weight ensures consistent reweighting of samples from π to \({\pi }^{{\prime} }\), enabling unbiased gradient estimation even with different policy distributions. Therefore, on the basis of Lemma 2, we approximate the gradient term in equation (17) as:

$${\omega }^{{\prime} }= \omega+\beta \cdot {\left[{\nabla }_{\omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\right]}_{\omega }\\= \omega+\beta \cdot {{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\nabla }_{\omega }{\left[{{\mathbb{E}}}_{{\mu }_{s,a}}\left[\frac{{\pi }_{{\theta }^{{\prime} }}}{{\pi }_{\theta }}\cdot {A}_{\overline{R}}^{{\pi }_{\theta }}(s,a)\right]\cdot {A}_{{R}_{\omega }}^{{\pi }_{\theta }}(s,a)\right]}_{\omega }\right].$$

(20)

Assumption 3

Assume that the policy and reward lying between two consecutive optimization steps are stationary.

In practical implementations, policy updates can be reasonably as stable, implying that the policy distributions before and after the update remain near-identical. Consequently, Assumption 3 further simplifies equation (20). On the basis of Assumptions 2–3 and Lemma 2, in the upper-level of the optimization framework, the reward function parameter update method of equation (17) can be approximated as follows:

$${\omega }^{{\prime} }= \omega+\beta \cdot {\left[{\nabla }_{\omega }{{{{\mathcal{J}}}}}^{{{{\rm{upper}}}}}\right]}_{\omega }\\ \approx \,\omega+\beta \cdot {{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\nabla }_{\omega }{\left[{{\mathbb{E}}}_{{\mu }_{s,a}}\left[{A}_{\overline{R}}^{{\pi }_{\theta }}(s,a)\right]\cdot {A}_{{R}_{\omega }}^{{\pi }_{\theta }}(s,a)\right]}_{\omega }\right]\\ \approx \,\omega+\beta \cdot {{\mathbb{E}}}_{{\mu }_{s,a}}\left[{\nabla }_{\omega }{\left[{A}_{\overline{R}}^{{\pi }_{\theta }}(s,a)\cdot {A}_{{R}_{\omega }}^{{\pi }_{\theta }}(s,a)\right]}_{\omega }\right].$$

(21)

In contrast to equation (17), the proposed gradient update term in equation (21) does not require the policy gradient parameters. Rather, it relies only on the value function, enabling the update method for the reward parameter ω to be applied to both policy-based and value-based RL agents.



Source link