Deep Reinforcement Learning: Policy Gradient and Actor-Critic

11 minute read

Published:

In this post, we review the basic policy gradient algorithm for deep reinforcement learning and the actor-critic algorithm. Most of the contents are derived from CS 285 at UC Berkeley.

graph TD; id1[fit a model/estimate the return]-->id2[improve the policy]; id2-->id3[generate samples]; id3-->id1;

Policy Gradient Introduction

In this part, we will first focus on the fully observable model with finite horizon, i.e., \(o_t = s_{t}\) and \(t\) is bounded. At a very high level, the policy gradient method is very simple: The policy gradient algorithm parametrized the policy \(\pi_{\theta}(a|s)\) by \(\theta\), and use samples (interactions with the environment) to update the parameter \(\theta\).

Next, we first derive the policy gradient algorithm (REINFORCE) and introduce some techniques to improve the performance of policy graident.

Evaluating the objective

First, we use \(\tau = \{s_1,a_1,s_2,a_2,\dots,s_T,a_T\}\) to denote the trajectory of 1 trial, \(r(\tau)\) to denote the reward of trajectory \(\tau\), and \(p_{\theta}(\tau),\pi_{\theta}(\tau)\) to denote the probability that trajectory \(\tau\) appears under the policy \(\pi_{\theta}(a|s)\).

Recall that our reinforcement learning objective can be written as

\[\theta^* = \arg\max_{\theta} \mathbb E_{\tau\sim p_{\theta}(\tau)}\left[\sum_t r(s_{t},a_{t})\right].\]

We define \(J(\theta) = \mathbb E_{\tau\sim p_{\theta}(\tau)}\left[\sum_t r(s_{t},a_{t})\right]\), and the objective becomes \(\theta^* = \arg\max_{\theta} J(\theta)\). We want to compute the gradient of \(J(\theta)\), and then we can update our policy \(\pi_{\theta}\).

First, we have

\[J(\theta) = \mathbb E_{\tau\sim p_{\theta}(\tau)}\left[\sum_t r(s_{t},a_{t})\right] = \int \pi_{\theta}(\tau)r(\tau)d\tau.\]

From the basic calculus computations, we have

\[\pi_{\theta}(\tau)\nabla_{\theta}\log\pi_{\theta}(\tau) = \pi_{\theta}(\tau)\frac{\nabla_{\theta}\pi_{\theta}(\tau)}{\pi_{\theta}(\tau)} = \nabla_{\theta}\pi_{\theta}(\tau).\]

Then, plug into the the derivative of \(J(\theta)\), we can get

\[\nabla_{\theta}J(\theta) = \int\nabla_{\theta}\pi_{\theta}(\tau)r(\tau)d\tau = \mathbb E_{\tau\sim\pi_{\theta}(\tau)}\left[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)\right].\]

Then, recall that from the Markovian assumption,we have

\[\pi_{\theta}(s_1,a_1,\dots,s_T,a_T) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_{t} | s_{t})p(s_{t+1} | s_{t},a_{t}).\]

Taking the logarithms on both sides, we can get

\[\log\pi_{\theta}(\tau) = \log p(s_1) + \sum_{t=1}^T(\log\pi_{\theta}(a_{t} | s_{t})+\log p(s_{t+1}|s_{t},a_{t})).\]

Taking the gradient, the first term and the third term are thrown away, and we have the following formula for the policy gradient.

\[\nabla_{\theta} J(\theta) = \mathbb E_{\tau\sim\pi_{\theta}(\tau)}\left[\left(\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\right)\left(\sum_{t=1}^T r(s_{t},a_{t})\right)\right]\]

Evaluating the policy graident and REINFORCE algorithm

Now given the formula for the policy gradient, we can estimate the expectation in policy gradient by samples. Specifically, we have the following formula for the estimatation.

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right),\]

where \(i\) denotes the index of the trajectory.

Then, given an estimation of the policy gradient, we can then improve our policy by the gradient method \(\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\).

Putting them all together, we have a simple REINFORCE algorithm (Monte-Carlo policy gradient).

  1. For \(s=1,2,\dots,\)
    1. sample \(\{\tau^i\}\) from \(\pi_{\theta}(a_{t}\|s_{t})\) (run the policy)
    2. \[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\]
    3. \[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]

Note that this algorithm may behave not very well because of the high variance. We need to use variance reduction techniques to improve this algorithm. But before we introduce the variance reduction techniques, we may first see an example of the REINFORCE algorithm and see its generalization into the partially observable case.

Now, we compute the policy gradient for Gaussian policies

\begin{align} \pi_{\theta}(a_{t}|s_{t}) =& \mathcal N(f_{\theta}(s_{t});\Sigma), \newline \log\pi_{\theta}(a_{t}|s_{t}) =& -\frac{1}{2}(f_{\theta}(s_{t})-a_{t})^T\Sigma^{-1}(f_{\theta}(s_{t})-a_{t}) + const, \newline \nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}) =& -\Sigma^{-1}(f_{\theta}(s_{t})-a_{t})\frac{df}{d\theta}. \end{align}

The REINFORCE algorithm simply tends to increase the probability of a good trajectory with high reward, and decrease the probability of a bad trajectory with low reward. The intuition behind REINFORCE can be summarized as: good stuff is made more likely, bad stuff is mode less likely, simply formalized the notion of “trial and error”.

It is also very easy to generalize the REINFORCE algorithm into the partially observable case. For partially observable cases, we just change \(s_{t}\) into \(o_t\) in the policy part. The estimation of the policy gradient is given as follow

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|o_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right).\]

Variance Reduction Technqiues

What is the problem with the policy gradient? As mentioned before, the policy gradient suffers from very high variance.

Recall that the policy gradient is

\[\nabla_{\theta}J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau).\]

The variance is vary large because \(r(\tau)\) is very large. Moreover, adding the reward function by a constant will change the result. For example if we add a large constant to the reward function and let all rewards to be positive, then in each iteration, the policy gradient method will try to increase the probability of all sampled trajectories.

There are 2 methods that can reduce the variance of the policy gradient method, causality and baseline, and we will introduce them one by one.

Causality

The intuition of causality is very simple: the policy at time \(t'\) cannot affect reward at time \(t < t'\). Then instead of using the whole \(r(\tau)\) to update the \(\pi_{\theta}\) at time \(t\), we can only focus on the reward that happens after time \(t\). Formally, we have the following formula

\begin{align} \nabla_{\theta} J(\theta)\approx & \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right) \newline = & \frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\left(\sum_{t’=t}^T r(s_{i,t’},a_{i,t’})\right) \end{align}

The last term can be viewed as “reward to go”, and we use \(\hat Q_{i,t} = \sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\) to denote that quantity.

It is easy to know that the causality helps to reduce the variance, because smaller number leads to smaller variance. In practice this basically always help.

Baselines

The baselines technique simply means to add constants to the reward function.

We want

\[\nabla_{\theta}J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(\tau)[r(\tau)-b],\]

where \(b\) is some constant. We may simply choose \(b = \frac{1}{N}\sum_{i=1}^N r(\tau)\).

Then, we show that adding constants to the reward function does not change the policy gradient. We have \begin{align} \mathbb E\left[\nabla_{\theta}\log\pi_{\theta}(\tau)b\right] = & \int\pi_{\theta}(\tau)\nabla_{\theta}\log\pi_{\theta}(\tau)bd\tau \newline = & b\nabla_{\theta}\int \pi_{\theta}(\tau)d\tau \newline = & b\nabla_{\theta}1 = 0. \end{align}

The previous argument shows that subtracting a baseline is unbiased in expectation. The average reward is not the best baseline, but it is often good in practice. A more careful analysis will show that \(b = \frac{g(\tau)^2r(\tau)}{\mathbb E[g(\tau)^2]}\) is the best baseline, where \(g(\tau) = \nabla_{\theta}\log\pi_{\theta}(\tau)\).

Off-policy learning and importance sampling

Policy gradient is on-policy \(\nabla_{\theta} J(\theta) = \mathbb E_{\tau\sim\pi_{\theta}(\tau)}\left[\left(\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\right)r(\tau)\right]\)

Changing the policy will change the probability distribution for the expectation. Neural networks change only a litle bit with rach gradient step, and on-policy learning can be extremely inefficient.

Now recall that \(J(\theta) = \mathbb E_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)]\), what if we don’t have samples from \(\pi_{\theta}(\tau)\) but have samples from $$\bar\pi(\tau) instead?

Importance sampling \begin{align} \mathbb E_{s\sim p(x)}[f(x)] =& \int p(x)f(x)dx \newline =& \int\frac{q(x)}{q(x)}p(x)f(x)dx \newline =& \int q(x)\frac{p(x)}{q(x)}f(x)dx \newline =& \mathbb E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{align}

Now using the importance sampling,

\(J(\theta) = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\pi_{\theta}(\tau)}{\hat\pi(\tau)}r(\tau)\right]\).

Now recall that $$\pi_{\theta}(\tau) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_{t}s_{t})p(s_{t+1}s_{t},a_{t})$$, we know that
\[\frac{\pi_{\theta}(\tau)}{\bar\pi(\tau)} = \frac{p(s_1)\prod_{t=1}^T \pi_{\theta}(a_{t}|s_{t})p(s_{t+1}|s_{t},a_{t})}{p(s_1)\prod_{t=1}^T \hat\pi(a_{t}|s_{t})p(s_{t+1}|s_{t},a_{t})} = \frac{\prod_{t=1}^T \pi_{\theta}(a_{t}|s_{t})}{\prod_{t=1}^T\hat\pi(a_{t}|s_{t})}\]

Deriving the policy gradient with importance sampling

\[J(\theta) = \mathbb E_{\tau\sim\bar\pi(\tau)}r(\tau)\]

can we estimate the value of some new parameters \(\theta'\)?

\[J(\theta') = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\prod_{t=1}^T \pi_{\theta'}(a_{t}|s_{t})}{\prod_{t=1}^T \pi_{\theta}(a_{t}|s_{t})}r(\tau)\right]\]

Taking the gradient

\[\nabla_{\theta'}J(\theta') = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\nabla_{\theta'}\pi_{\theta'}(\tau)}{\frac{\pi_{\theta}(\tau)}r(\tau)\right] = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\frac{\pi_{\theta}(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau)r(\tau)\right]\]

When \(\theta = \theta'\), it is exactly the policy gradient. If \(\theta \neq\theta'\),

\(\nabla_{\theta'}J(\theta') =\mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\frac{\pi_{\theta}(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau)r(\tau)\right]\\ =\mathbb E_{\tau\sim\bar\pi(\tau)}\left[\left(\prod_{t=1}^T\frac{ \pi_{\theta'}(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}\right)\left(\nabla_{\theta'}\log\pi_{\theta'}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\right]\\ =\mathbb E_{\tau\sim\bar\pi(\tau)}\left[\sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_{i,t}|s_{i,t})\left(\prod_{t'=1}^t\frac{ \pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})}\right)\left(\sum_{t''=t}^{t'} r(s_{t'},a_{t'})\right)\left(\prod_{t''=t}^{t'}\frac{ \pi_{\theta'}(a_{t''}|s_{t''})}{\pi_{\theta}(a_{t''}|s_{t''})}\right)\right]\) where the last line uses causality.

Policy gradient with automatic differentiation

\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\hat Q_{i,t}\]

Just implement a “pseudo-loss” as a weighted maximum likelihood:

\[\tilde J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\hat Q_{i,t}\]

Actor-Critic Introduction

Review of policy gradient and intuition for actor-critic

\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\hat Q_{i,t}\]

\(\hat Q_{i,t}\): estimate of expected reward if we take action \(a_{i,t}\) in state \(s_{i,t}\). Can we get a better estimate?

$$Q(s_{t},a_{t}) = \sum_{t’=t}^{T}\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t},a_{t}]$$ is the true expected reward to go.
\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})Q(s_{i,t},a_{i,t})\]

What about the baseline? Previously, we set \(b = \frac{1}{N}\sum_i\hat Q_{i,t}\), and now we want to set the baseline \(b_t = \frac{1}{N}\sum_i Q(s_{i,t},a_{i,t})\).

\[V(s_{t}) = \mathbb E_{a_{t}\sim\pi_{\theta}(a_{t}|s_{t})[Q(s_{t},a_{t})]}\]

Note that the Q-functions, the value functions are all related to a policy \(\pi\), and we have

$$Q^{\pi}(s_{t},a_{t}) = \sum_{t’=t}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t},a_{t}]$$,
$$V^{\pi}(s_{t}) = \mathbb E_{a_{t}\sim \pi_{\theta}(a_{t}s_{t})}[Q^{\pi}(s_{t},a_{t})]$$,

\(A^{\pi}(s_{t},a_{t}) = Q^{\pi}(s_{t},a_{t}) - V^{\pi}(s_{t})\): the advantage, how much \(a_{t}\) is.

Then,

\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})A^{\pi}(s_{i,t},a_{i,t})\]

The better the estimate, the lower the variance.

Value function fitting What to fit? \(Q^{\pi},V^{\pi},A^{\pi}\)

$$Q^{\pi}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t’=t+1}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t},a_{t}]\approx r(s_{t},a_{t}) + V^{\pi}(s_{t+1})$$,

where the last step is unbiased. Then, we can write out the advantage \(A^{\pi}\) as

\[A^{\pi}(s_{t},a_{t}) \approx- r(s_{t},a_{t}) + V^{\pi}(s_{t+1})-V^{\pi}(s_{t})\]

Let’s fit \(V^{\pi}\). We use a neural network to represent \(\hat V^{\pi}(s)\) with parameter \(\phi\). Currently, \(\phi\) and \(\theta\), the parameters for the policy and the advantage are not the same (Later, combine them).

Policy evaluation

$$V^{\pi}(s_{t}) = \sum_{t’=t}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t}]$$,
\[J(\theta) = \mathbb E_{}[V^{\pi}(s_1)]\]

how can we perform policy evaluation? Monte Carlo policy evaluation (this is what policy gradient does)

\[V^{\pi}(s_{t})\approx\sum_{t'=t}^Tr(s_{t'},a_{t'})\] \[V^{\pi}(s_{t})\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})\]

Need to “reset” the simulators.

Monte Carlo evaluation with function approximation

\[V^{\pi}(s_{t})\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\]

not as good as \(V^{\pi}(s_{t})\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})\) but is still pretty good.

Training data: \(\{(s_{i,t},\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}))\} = \{(s_{i,t},y_{i,t})\}\)

Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j} \hat V^{\pi}_{\phi}(s_j)-y_j ^2$$.

Can we do better?

Ideal target: $$y_{i,t} = \sum_{t’=t}^T \mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{i,t}] \approx r(s_{i,t},a_{i,t}) + V^{\pi}(s_{i,t+1})\approx r(s_{i,t},a_{i,t})+V^{\pi}{\phi}(s{i,t+1})$$

The last step means to directly use the previous fitted value function. Smaller variance.

Monte Carlo target: \(y_{i,t} = \sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\)

Training data: \(\{(s_{i,t},r(s_{i,t},a_{i,t}) + \hat V^{\pi}_{\phi})\} = \{(s_{i,t},y_{i,t})\}\)

Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j} \hat V^{\pi}_{\phi}(s_j)-y_j ^2$$.

This is sometimes referred to as a “bootstrapped” estimate.

Actor-critic algorithm

batch actor-critic algorithm:

  1. sample \(\{s_i,a_i\}\) from $$\pi_{\theta}(a_{t}s_{t})$$ (run the policy)
  2. fit \(\hat V^{\pi}_{\phi}(s)\) to sampled reward sums
  3. evaluate \(\hat A^{\pi}(s_i,a_i) = r(s_i,a_i) + \hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i)\)
  4. \[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\hat A^{\pi}(s_{i},a_{i})\]
  5. \[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]

Here, recall that we have training data: \(\{(s_{i,t},r(s_{i,t},a_{i,t}) + \hat V^{\pi}_{\phi})\} = \{(s_{i,t},y_{i,t})\}\)

Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j} \hat V^{\pi}_{\phi}(s_j)-y_j ^2$$.

Aside: discount factors

When the trajectories are likely to go to infinity or become very large, use a discout factor. We set

\(y_{i,t} \approx r(s_{i,t},a_{i,t}) + \gamma\hat V^{\pi}_{\phi})\), discount factor \(\gamma\in [0,1]\) (and 0.99 works well)

with critic

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(r(s_i,a_i) + \gamma\hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i))\right)\]

batch actor-critic algorithm with discount:

  1. sample \(\{s_i,a_i\}\) from $$\pi_{\theta}(a_{t}s_{t})$$ (run the policy)
  2. fit \(\hat V^{\pi}_{\phi}(s)\) to sampled reward sums
  3. evaluate \(\hat A^{\pi}(s_i,a_i) = r(s_i,a_i) + \gamma\hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i)\)
  4. \[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\hat A^{\pi}(s_{i},a_{i})\]
  5. \[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]

online actor-critic algorithm with discount:

  1. take action $$a\sim \pi_{\theta}(as)\(, get\)(s,a,a’,r)$$
  2. update \(\hat V^{\pi}_{\phi}\) using target \(r+\gamma \hat V^{\pi}_{\phi}(s')\)
  3. evaluate \(\hat A^{\pi}(s,a) = r(s,a) + \gamma\hat V^{\pi}_{\phi}(s') - \hat V^{\pi}_{\phi}(s)\)
  4. \[\nabla_{\theta} J(\theta)\approx \left(\nabla_{\theta}\log\pi_{\theta}(a|s)\right)\hat A^{\pi}(s,a)\]
  5. \[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]

Architecture design

two network design: simple and stable, but not feature sharing

one network design: the output contains 2 networks, harder to tune the hyper-parameters

online actor-critic algorithm in practice

online actor-critic will not work for deep neural networks, in step 2, would like to update the critic by a batch of data. In practice, get the data in parallel.

Critics as state-dependent baselines.

For the actor-critic algorithm, the gradient is estimated by

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(r(s_i,a_i) + \gamma\hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i))\right)\]

The good thing is this representation has low variance (due to the critic), but it is not unbiased (if the critic is not perfect).

For the original policy gradient

\[\nabla_{\theta} J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\left(\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)-b\right)\]

This method has no bias, but has a higher variance (because of the single-sample estimate). Combine them together

\[\nabla_{\theta} J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\left(\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)-\hat V^{\pi}_{\phi}(s_i)\right)\]

There is no bias and has a lower variance.

n-step estimation

\[\hat A^{\pi}_C(s_{t},a_{t}) = r(s_{t},a_{t}) + \gamma \hat V^{\pi}_{\phi}(s_{t+1}) - \hat V^{\pi}_{\phi}(s_{t})\] \[\hat A^{\pi}_{MC}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat V^{\pi}_{\phi}(s_{t})\] \[\hat A^{\pi}_{n}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat V^{\pi}_{\phi}(s_{t})+\gamma^n \hat V^{\pi}_{\phi}(s_{t+n})\]

choosing \(n > 1\) often works better

Generalized advantage estimation

\[\hat A^{\pi}_{n}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat V^{\pi}_{\phi}(s_{t})+\gamma^n \hat V^{\pi}_{\phi}(s_{t+n})\]

\(\hat A^{\pi}_{GAE}(s_{t},a_{t}) = \sum_{n=1}^{\infty}w_n\hat A^{\pi}_{n}(s_{t},a_{t})\),

can choose \(w_n\prop \lambda^{n-1}\)

References

[1] CS 285 at UC Berkeley