Understanding Bias in RL

The quest of understanding or inferring data beyond what has been acquired has been an age old statistics problem. And with understanding data, it’s non-trivial to be able to measure the “goodness” of a system with the task of predicting unseen data. We call this “goodness” term as the error. or rather the “badness” The two underlying features of this error are bias and variance aka the structural error and estimation error.

  1. Structural Error - This type of error is incurred when the true underlying relationship between the data points is of a different structure than the predicted function’s structure. For example, if the data is non-linear but the predicted function is linear.

  2. Estimation Error - In machine learning, this is the same as the training loss, and is usually incurred when the training dataset is very small. Increasing the number of training examples decreases this error. Estimation loss is always incurred when fitting the model such that it is more generalizable.

Note: There is a trade-off between these two errors. Using a rich high- dimensional function for prediction might reduce structural mistakes, but the same dataset would be used to train many more parameters thereby incurring an estimation loss. On the other hand, if you have the minimum number of parameters to help estimation error, structural error is introduced.

Estimators

In (Fisher (1922)), Fisher introduced the method of Maximum Likelihood Estimation (MLE) which can be understood using the following application of it in a toy problem:

Given an n-variate normal distribution with a (co)variance of 1 and a sample from the distribution, can you estimate it’s mean?

The MLE method proposes that a good estimate of the mean can only be the sample itself, which does hold true in lower-dimensional gaussians but in higher dimensions, this fails drastically. notion of “good” explained later

A good estimator can be defined as one that reduces the L2-norm (L\mathcal L) distance between the true mean and the predicted mean.

L(x)=xμ2. \mathcal{L}(x)=\lVert x-\mu\rVert^2.

This has been the explanation for a simple toy problem. Now let’s extend it for nn i.i.d. samples X1,,XnX_1,\ldots,X_n sampled from a gaussian XiN(θi,1)X_i\sim\mathcal{N}(\theta_i,1) for each i=1,,ni=1,\ldots,n. Now, to estimate θ\theta for all such nn random variables with only the noisy samples for each ii, a trivial estimator can be θ^i=Xi\hat\theta_i=X_i.

Now, determining the “goodness” of this estimator can be done using the estimated loss value of this estimator:

L(θ^)=E[θ^θ2].\mathcal L(\hat\theta)=\mathbb{E}\left[\lVert\hat\theta-\theta\rVert^2\right].

This trivial estimator is suboptimal whenever n3n\geq3 (inadmissible) which means that there exists an estimator θ^\hat\theta' s.t. L(θ^)L(θ^)\mathcal L(\hat\theta')\leq\mathcal L(\hat\theta) with n3n\geq 3.

Implicit Bias and Variance

Here we show that the notion of bias and variance is implicit in the definition of an error that classify both structural and estimation error. To see, let’s look at the formula of variance:

Var[θ]=E[θ2]E[θ]2.\text{Var}[\theta]=\mathbb{E}[\theta^2]-\mathbb{E}[\theta]^2.

Introducing the mean Xi:X_i:

Var[θX]=E[(θX)2]E[θX]2.\text{Var}[\theta-X]=\mathbb{E}\left[(\theta-X)^2\right]-\mathbb{E}[\theta-X]^2.

Subsequently, we realize the last equation as:

E[(θX)2]=E[θX]2bias2+Var[θX]variance error=bias2+variance,\mathbb{E}\left[(\theta-X)^2\right]=\underbrace{\mathbb{E}[\theta-X]^2}_{\text{bias}^2}+\underbrace{\text{Var}[\theta-X]}_\text{variance}\\ \implies \text{error}=\text{bias}^2+\text{variance},

where E[(θX)2]\mathbb{E}\left[(\theta-X)^2\right] is the L2-norm term. We see the mathematics behind the implicit nature of bias and variance in the loss function (L2-norm here). Here we also see that the bias produces a bigger change to the error and thus the derivative of the error w.r.t. bias is higher than the variance.

An error term such as the L2-norm i=1nE[θ^iθi2]\sum_{i=1}^n\mathbb{E}\left[\lVert\hat\theta_i-\theta_i\rVert^2\right] that sums the different loss values of each estimation induces a bias-variance tradeoff. The loss function on itself has an inadmissible estimator i.e. θ^i=Xi\hat\theta_i=X_i. But with the case of the L2-norm, a shrinkage factor introduced by James and Stein decreases variance by introducing a bit of bias, therweby reducing the L2-norm.

Stein's paradox

We see that stein’s estimator given below utilizes the then-unknown fact that bias can be leveraged for variance to improve the error:

θJS=(1(p2)σ2X22)Xi.\theta_\text{JS}=\left(1-\frac{(p-2)\sigma^2}{\lVert X\rVert^2_2}\right)X_i.

Instead of the common θ=Xi\theta=X_i, there is a new term (1(p2)σ2X22)\left(1-\frac{(p-2)\sigma^2}{\lVert X\rVert^2_2}\right) called the “shrinkage factor” that skews some data points XiX_i towards the origin.[1]

Stein's influence in RL

Policy gradients

Recall that the stochastic reinforcement learning objective is to learn a policy distribution πθ(st)\pi_\theta(\mathbf s_t) that maps states to actions. I will be assuming that the reader (you) knows a bit about RL but in case you don’t, refer to any deep RL book and you shall receive that wisdom soon enough. In policy gradients, we train a policy such as a deep neural network with weights θ\theta and the weights of the network that maximize the objective function which is represented as θ\theta^*. We can represent the gradient of the objective function that the model optimizes as J(θ)J(\theta) (formulated below):

J(θ)=Eτπθ(τ)[r(τ)]J(\theta)=E_{\tau\sim\pi_\theta(\tau)}[r(\tau)]

where each rollout τ\tau is the trajectory of state-action pairs seen during a rollout in TT time-steps, and the probability of a trajectory τ\tau can be defined as:

πθ(τ)=p(s0,a0,,sT1,aT1)=p(s0)πθ(a0s0)t=1T1p(stst1,at1)πθ(atst)\pi_\theta(\tau)=p(s_0,a_0,\ldots,s_{T-1},a_{T-1})=p(s_0)\pi_\theta(a_0|s_0)\prod_{t=1}^{T-1}p(s_t|s_{t-1},a_{t-1})\pi_\theta(a_t|s_t)

and the total reward accumulated during the rollout of τ\tau:

r(τ)=r(s0,a0,,sT1,aT1)=t=1T1r(st,at).r(\tau)=r(s_0,a_0,\ldots,s_{T-1},a_{T-1})=\sum\limits_{t=1}^{T-1}r(s_t,a_t).

The policy gradient approach is to directly take the gradient of this objective (Eq. 22):

θJ(θ)=θπθ(τ)r(τ)dτ=πθ(τ)θlogπθ(τ)r(τ)dτ=Eτπθ(τ)[θlogπθ(τ)r(τ)].\begin{aligned}\nabla_\theta J(\theta)&=\nabla_\theta\int \pi_\theta(\tau)r(\tau)d\tau \\&=\int\pi_\theta(\tau)\nabla_\theta\log\pi_\theta(\tau)r(\tau)d\tau\\&=E_{\tau\sim\pi_\theta(\tau)}[\nabla_\theta\log\pi_\theta(\tau)r(\tau)]. \end{aligned}

In practice, the expectation over trajectories τ\tau can be approximated from a batch of NN sampled trajectories:

θJ(θ)1Ni=1Nθlogπθ(τi)r(τi)=1Ni=1N(t=1Tθlogπθ(aitsit))(t=1T1r(sit,ait)).\begin{aligned} \nabla_\theta J(\theta)&\approx\frac{1}{N}\sum\limits_{i=1}^N\nabla_\theta\log\pi_\theta(\tau_i)r(\tau_i) \\&=\frac{1}{N}\sum\limits_{i=1}^N\left(\sum\limits_{t=1}^T\nabla_\theta\log\pi_\theta(a_{it}|s_{it})\right)\left(\sum\limits_{t=1}^{T-1}r(s_{it},a_{it})\right). \end{aligned}

Here we see that the policy πθ\pi_\theta is a probability distribution over the action space, conditioned on the state. In the agent-environment loop, the agent samples an action ata_t from πθ(st)\pi_\theta(\cdot|s_t) and the environment responds with a reward r(st,at)r(s_t,a_t).

What is wrong with PG?

We see that the variance of the sample reward collected along a few trajectories in the space of all possible trajectories is high meaning that predicting the return of a trajectory becomes worse with better performing trajectories.[2]

The different colored trajectories are plotted along with the running reward variance. It's clear here that different trajectories lead to significantly different returns. even in agents like PPO (such as the one used in the image), learned trajectories have high return variance.

A relatively high performing trajectory changes the distribution much more than a relatively less performing one. The trick is to minimize absolute return while maintaining the relationship of the relative rewards amongst trajectories. So the task is to reduce variance. One way we achieve this is by discarding the reward achieved at a previous time step.

By using the causality trick, we reduce expectation thereby reducing variance and by adding baselines, we affect sample rewards by normalizing them, thereby reducing variance. Similarly, in the off-policy policy gradient methods, i.e. where the model is trained on a different dataset and learns new parameters 𝜃' (similar to fine-tuning in DL terms).

What else is wrong with Policy Gradients?

We can notice that there is no regularizing term in the objective function (dissimilar to better ML objectives) and so the gradient can change parameters a lot.

The key problem is that some parameters are delicate as in they change a lot with a small change in the gradient step, so adding a regularizing term with a weight to each parameter is essential to a reliable performance.

The solution is to rescale the gradient instead of the parameters while training. So the regularizing term is on the policy instead! This was shown in (Peters & Schaal, 2008).

[1] Doing so introduces more bias while reducing variance, but then why doesn’t this work for dimensions n2n\leq2?
[2] You can find the code for this here.

Variance Reduction

Reward-to-go

One way to reduce the variance of the policy gradient is to exploit causality: the notion that the policy cannot affect rewards in the past. This yields the following modified objective, where the sum of rewards here does not include the rewards achieved prior to the time step at which the policy is being queried. This sum of rewards is a sample estimate of the Q function (Q^i,tπ\hat{Q}^\pi_{i,t}), and is referred to as the “reward-to-go”. The Q-function can also be definedas the estimate of expected reward if we take action ai,t\mathbf a_{i,t} in state si,t\mathbf s_{i,t}.

Discounting

Multiplying a discount factor γ\gamma to the rewards can be interpreted as encouraging the agent to focus more on the rewards that are closer in time, and less on the rewards that are further in the future. This can also be thought of as a means for reducing variance (because there is more variance possible when considering futures that are further into the future). We saw in lecture that the discount factor can be incorporated in two ways, as shown below.

The first way applies the discount on the rewards form full trajectory:

θJ(θ)1Ni=1N(t=0T1θlogπθ(aitsit))(t=0T1γttr(sit,ait)).\nabla_\theta J(\theta)\approx\frac{1}{N}\sum\limits_{i=1}^N\left(\sum\limits_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_{it}|s_{it})\right)\left(\sum\limits_{t'=0}^{T-1}\gamma^{t'-t}r(s_{it'},a_{it'})\right).

Baselines

Another variance reduction method is to subtract a baseline (that is a constant with respect to τ\tau) from the sum of rewards:

θJ(θ)=θEτπθ(τ)[r(τ)b].\nabla_\theta J(\theta)=\nabla_\theta E_{\tau\sim\pi_\theta(\tau)}[r(\tau)-b].

This leaves the policy gradient unbiased because

θEτπθ(τ)[b]=0.\nabla_\theta E_{\tau\sim\pi_\theta(\tau)}[b]=0.

This value function will be trained to approximate the sum of future rewards starting from a particular state:

Vϕπ(st)t=tT1Eπθ[r(st,at)st], V_{\phi}^\pi(s_t)\approx\sum\limits_{t'=t}^{T-1}E_{\pi_\theta}[r(s_{t'},a_{t'})|s_t],

so the approximate policy gradient now looks like this:

θJ(θ)1Ni=1Nt=0T1θlogπθ(aitsit)((t=tT1γttr(sit,ait))Vϕπ(sit)). \nabla_\theta J(\theta)\approx\frac{1}{N}\sum\limits_{i=1}^N\sum\limits_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_{it}|s_{it})\left(\left(\sum\limits_{t'=t}^{T-1}\gamma_{t'-t}r(s_{it'},a_{it'})\right)-V_{\phi}^\pi(s_{it})\right).

Generalized Advantage Estimation

The quantity (t=tT1γttr(st,at))Vϕπ(st)\left(\sum_{t'=t}^{T-1}\gamma^{t'-t}r(s_{t'},a_{t'})\right)-V_{\phi}^\pi(s_t) from the pervious policy gradient expression (removing the ii index for clarity) can be interpreted as an estimate of the advantage function:

Aπ(st,at)=Qπ(st,at)Vπ(st), A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t),

where Qπ(st,at)Q^\pi(s_t,a_t) is estimated using Monte Carlo returns and Vπ(st)V^\pi(s_t) is estimated using the learned value function VϕπV_\phi^\pi. Aπ(st,at)A^\pi(s_t,a_t) is called as the advantage function. We can further reduce variance by also using VϕπV^\pi_\phi in place of the Monte Carlo returns to estimate the advantage function as:

Aπ(st,at)δt=r(st,at)+γVϕπ(st+1)Vϕπ(st),A^\pi(s_t,a_t)\approx\delta_t=r(s_t,a_t)+\gamma V_\phi^\pi(s_{t+1})-V^\pi_\phi(s_t),

with the edge case δT1=r(sT1,aT1)Vϕπ(sT1)\delta_{T-1}=r(s_{T-1},a_{T-1})-V^\pi_\phi(s_{T-1}). However, this comes at the cost of introducing bias to our policy gradient estimate, due to modelling errors in VϕπV_\phi^\pi. We can instead use a combination of nn-step Monte Carlo returns and VϕπV_\phi^\pi to estimate the advantage function as:

Anπ(st,at)=t=tt+nγttr(st,at)+γnVϕπ(st+n+1)Vϕπ(st).A_n^\pi(s_t,a_t)=\sum\limits_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^nV_\phi^\pi(s_{t+n+1})-V^\pi_\phi(s_t).

Increasing nn incorporates the Monte Carlo returns more heavily in the advantage estimate, which lowers bias and increases variance, while decreasing nn does the opposite. Note that n=Tt1n=T-t-1 recovers the unbiased but higher variance Monte Carlo advantage estimate used in (14), while n=0n=0 recovers the lower variance but higher bias advantage estimate δt\delta_t used in (16).

We can combine multiple nn-step advantage estimates as an exponentially weighted sum, which is known as the generalized advantage estimator (GAE). Let λ[0,1]\lambda\in[0,1]. Then we define:

AGAEπ(st,at)=1λTt11λn=1Tt1λn1Anπ(st,at),A^\pi_\text{GAE}(s_t,a_t)=\frac{1-\lambda^{T-t-1}}{1-\lambda}\sum\limits_{n=1}^{T-t-1}\lambda^{n-1}A_n^\pi(s_t,a_t),

where 1λTt11λ\frac{1-\lambda^{T-t-1}}{1-\lambda} is a normalizing constant. Note that a higher λ\lambda emphasizes advantage estimates with higher values of nn, and a lower λ\lambda does the opposite. Thus, λ\lambda serves as a control for the bias-variance tradeoff, where increasing λ\lambda decreases bias and increases variance. In the infinite horizon case (T=T=\infty), we can show:

AGAEπ(st,at)=11λn=1λn1Anπ(st,at)=t=t(γλ)ttδt,\begin{aligned} A_\text{GAE}^\pi(s_t,a_t)&=\frac{1}{1-\lambda}\sum\limits_{n=1}^\infty\lambda^{n-1}A_n^\pi(s_t,a_t)\\&=\sum\limits_{t'=t}^\infty (\gamma\lambda)^{t'-t}\delta_{t'}, \end{aligned}

where we have omitted the derivation for brevity (see the GAE paper for details). In the finite horizon case, we can write:

AGAEπ(st,at)=t=tT1(γλ)ttδt, A^\pi_\text{GAE}(s_t,a_t)=\sum\limits_{t'=t}^{T-1}(\gamma\lambda)^{t'-t}\delta_{t'},

which serves as a way we can efficiently implement the generalized advantage estimator, since we can recursively compute:

AGAEπ(st,at)=δt+γλAGAEπ(st+1,at+1). A^\pi_\text{GAE}(s_t,a_t)=\delta_t+\gamma\lambda A^\pi_\text{GAE}(s_{t+1},a_{t+1}).

Similarly, in the off-policy policy gradient methods, i.e. where the model is trained on a different dataset and learns new parameters θ\theta'. similar to fine-tuning in DL terms.

So we see that:

By using the causality trick, we reduce expectation thereby reducing variance and by adding baselines, we affect sample rewards by normalizing them, thereby reducing variance.

In GAE, we see that using the advantage function Aπ(st,at)A^\pi(s_t,a_t) is the replace for the baseline function.

If the advantage function is incorrect, then the entire policy gradient can be biased which is OK because we can tradeoff a bit of bias for the large reduction in variance

References

Further Reading

Gunbir Singh Baveja. Website built with Franklin.jl and the Julia programming language.