GRPO

Juan Vera

February 2025

Proximal Policy Optimization

PPO optimizes agents by the following surrogate objective:

J(πθ)=Et[min(ut(θ)Φt,clip(ut(θ),1ϵ,1+ϵ)Φt)] J(\pi_\theta) = \mathbb{E}_t\left[ \min(u_t(\theta)\Phi_t, \text{clip}(u_t(\theta), 1 - \epsilon, 1 + \epsilon) \Phi_t) \right]

where

~ ut(θ)=πθ(atst)πθold(atst)u_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)}
~ tt is the state-action pair at time step tt in the trajectory τ\tau, which is constructed by the policy πθ\pi_{\theta}, as Tπθ\mathcal{T} \sim \pi_\theta, where we have nn τT\tau \in \mathcal{T}
~ Φt\Phi_t is an ambiguous reward function.

Note that the clip\text{clip} function is an implicit indirect way to minimize the KL\text{KL} divergence of the current policy and a reference policy (model at the prior set of θ\theta during training).

Framing this in a setting for language models, we can construct this as:

JPPO(θ)=Et[1ot=1omin(πθ(otq,o<t)πθold(otq,o<t)At, clip(πθ(otq,o<t)πθold(otq,o<t),1ϵ,1+ϵ)At)] J_{\text{PPO}}(\theta) = \mathbb{E}_{t}\left[\frac{1}{|o|} \sum_{t=1}^{|o|} \min \left( \frac{\pi_{\theta}(o_t \mid q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t \mid q, o_{<t})} A_t, \ \text{clip} \left( \frac{\pi_{\theta}(o_t \mid q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t \mid q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_t \right) \right]

where

~ tt is the query-answer pair of the language model (old answer, current query, as o<to < t and qq respectively)
~ qP(Q),oπθold(oq)q \sim P(Q), o \sim \pi_{\theta_{\text{old}}}(o \mid q)
~ oto_t is equivalent to the "action", as the output to the language model
~ qq is the input or query
~ o<to_{<t} is the previous sequence of tokens.

AtA_t is the advantage function, which can be computed using Generalized Advantage Estimation based on a learned critic which estimates the value function VV.

In PPO, we need to train a value function alongside the policy model. The issue is that the reward model can become overoptimized, such that it begins to overfit on a poor policy network, beginning to provide high reward signal to actions which should be tied to low reward (this is given since we aim to maxJ(θ)\max J(\theta), and if we train the reward model alongside an untrained policy, we can be prone to incorrect optimization)

So, we can introduce a penalty based on the KL divergence as:

JPPO(θ)=Et[1ot=1omin(πθ(otq,o<t)πθold(otq,o<t)At, clip(πθ(otq,o<t)πθold(otq,o<t),1ϵ,1+ϵ)At)βKL(πθπθold)]J_{\text{PPO}}(\theta) = \mathbb{E}_{t}\left[\frac{1}{|o|} \sum_{t=1}^{|o|} \min \left( \frac{\pi_{\theta}(o_t \mid q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t \mid q, o_{<t})} A_t, \ \text{clip} \left( \frac{\pi_{\theta}(o_t \mid q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t \mid q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_t \right) - \beta\text{KL}(\pi_\theta \mid\mid \pi_{\theta_{old}})\right]

where βKL(πθπθold)\beta \text{KL}(\pi_\theta \mid\mid \pi_{\theta_{old}}) is a penalty based on a fraction of the divergence of the current policy with the old policy.

The intuition behind this is that we don't want πθ\pi_\theta to diverge from πθold\pi_{\theta_{old}} too much, causing unstable training. Therefore, if the KL\text{KL} divergence is the measure of divergence from the two, adding a penalty as a fraction (β\beta) of the divergence will reduce the reward proportional to the amount of divergence and consequently the objective function will be reduced.

Therefore, the policy gradients, JPPO\nabla J_{\text{PPO}} are also reduced in magnitude, to prevent unstable / large updates.

The reason we want to prevent unstable updates is to prevent the policy network to overfit to an critic network (which yields incorrect rewards as it isn't trained well), prior to the critic network properly fitting to the approximate reward function.

PPO make use of an advantage function to reduce variance in t he gradient updates (At=RtV(st)A_t = R_t - V(s_t)), but learning a value function can take significantly increase memory and computation costs (as it needs to be similar size).

In LLMs the reward model only assigns a reward to the final token, making the reward signal sparse with respect to other intermediate predictions, making it hard for the value function to estimate values at each token in the sequence (you don't want the intermediate sentences in the output to make little sense, with only the ending forming a sensible output.)

Group Relative Policy Optimization

GRPO removes the need for the approximation of the value function, and instead uses the average reward of multiple sampled outputs, in repsonse to the same question, as the baseline.

For each question qq, GRPo samples a group of outputs {o1,o2,,oG}\{o_1, o_2, \dots, o_G\},from the old policy and then optimizes the policy model by maximizing the objective:

JGRPO(θ)=EqP(Q),{oi}i=1Gπθold(Oq)[1Gi=1G1oit=1oimin(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)A^i,t,clip(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t),1ϵ,1+ϵ)A^i,t)βKL(πθπold)](1)A^i,t=rimean(r)std(r)J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O \mid q)} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_{\theta}(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})} \hat{A}_{i,t}, \, \text{clip} \left( \frac{\pi_{\theta}(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \text{KL}(\pi_{\theta} \parallel \pi_{\text{old}}) \right] \\[3mm] (1) \hspace{.25in}\hat{A}_{i, t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)}

GRPO purely only uses a reward model (rr) as the critic, to estimate the immediate reward at time tt, reducing the need for extra computational power that would be needed with a value model.

Instead GRPO uses the average rward of multiple sampled outputs, produced in the response to the same question ( query, qq ), as the baseline.

Consider the typical advantage function in PPO as:

Aπ(s)=Qπ(s,a)Vtπ(s) A^{\pi}(s) = Q^{\pi}(s,a) - V^{\pi}_t(s)

where Qπ(s,a)Q^\pi(s,a) is the estimated reward of being in the current state ss, by taking the specific action aa.

Note that Qπ(s,a)Q^{\pi}(s,a) is defined recursively for all possible future states and actions, after taking the action asa \rightarrow s', which can be seen via it's expanded bellman equation:

Qπ(s,a)=EsP(s,a)[R(s,a,s)+γEaπ(s)[EsP(s,a)[R(s,a,s)+γQπ(s,a)]]]Q^\pi(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)} \left[ R(s, a, s') + \gamma \mathbb{E}_{a' \sim \pi(\cdot \mid s')} \left[ \mathbb{E}_{s'' \sim P(\cdot \mid s', a')} \left[ R(s', a', s'') + \gamma Q^\pi(s'', a') \right] \right] \right]

or more concisely as:

Qπ(s,a)=EsP(s,a)[R(s,a,s)+γEaπ(as)[Qπ(s,a)]] \hspace{.15in} Q^\pi(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)} \left[ R(s, a, s') + \gamma \mathbb{E}_{a' \sim \pi(a' \mid s')} \left[ Q^\pi(s', a') \right] \right]

in PPO, we typically don't have an accurate way to estimate of the state-action value function, QQ. This is as QQ relies on actions that are to be taken at st+ns^{t+n} where nn is the count of your rollout.

Noting the immense count of possible actions aa that can be taken in a continuous action space at the current state ss, and the exponentially increasing set of total possible aa' as we consider different ss', this becomes computationally infeasible.

So PPO typically estimates the QQ-function in terms of the value function, which allows the Advantage function to be written in terms of VV.

Given the two bellman equations for QQ and VV, we can rewrite them in terms of each other as:

Vπ(s)=Eaπ(as)[EsP(ss,a)[R(s,a,s)+γVπ(s)]]=Eaπ(as)[Qπ(s,a)]Qπ(s,a)=EsP(s,a)[R(s,a,s)+γEaπ(as)[Qπ(s,a)]]=EsP(s,a)[R(s,a,s)+Vπ(s)]\hspace{.15in} V^\pi(s) = \mathbb{E}_{a \sim \pi(a \mid s)} \left[ \mathbb{E}_{s' \sim P(s' \mid s, a)} \left[ R(s, a, s') + \gamma V^\pi(s') \right] \right] \\[3mm] = \mathbb{E}_{a \sim \pi(a \mid s)}[Q^{\pi}(s, a)] \\[6mm] \hspace{.15in} Q^\pi(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)} \left[ R(s, a, s') + \gamma \mathbb{E}_{a' \sim \pi(a' \mid s')} \left[ Q^\pi(s', a') \right] \right]\\[3mm] = \mathbb{E}_{s' \sim P(\cdot \mid s, a)} \left[ R(s, a, s') + V^{\pi}(s')\right]

and therefore, an advantage function can be rewritten as:

(2)Aπ(s)=[R(s,a,s)+Vπ(s)]Vπ(s) (2) \hspace{.25in}A^{\pi}(s) = [R(s, a, s') + V^{\pi}(s')] - V^{\pi}(s)

where then the Generalized Advantage Estimation, which estimates the advantage function, can be constructed as:

AtGAE=l=1T(γ)lδt+1δt=rt+γVπ(s)Vπ(s) A_t^{\text{GAE}} = \sum_{l = 1}^T(\gamma)^l \delta_{t+1} \\[3mm] \delta_t = r_t + \gamma V^{\pi}(s') - V^{\pi}(s)

Note that the construction for δt\delta_t is the advantage function (eq 2), rewritten with a discount factor γ\gamma, to prioritize short-term rewards.

Even here, using the AtGAEA_t^{\text{GAE}}, can become computationally expensive as the value model which estimates Vπ()V^{\pi}(\cdot) is typically of similar size with respect to the policy network, π\pi. In SOTA language modelling, DeepSeekMath-7B for instance, you'd need a model of similar capacity, which can be computationally infeasible to train.

Or another example, DeepSeek-V3-671B. Pretty huge policy network and likely large value network. Training a large value network alongisde the LLM, and taking into account the multiple rollouts needed to construct a set of trajectories, T\mathcal{T}, in a continuous action space can be very costly.

So GRPO, simply rids of the need for GAE\text{GAE} and instead estimates the advantage as the as the difference between the reward for the current output oio_i and the group of outputs (collected via rollouts from the old policy), as seen in eq 1.

All that's left is to maxJGRPO(θ)\max J_{\text{GRPO}}(\theta), compute θJGRPO(θ)\nabla_\theta J_{\text{GRPO}}(\theta), and train π\pi via backpropagation and gradient ascent.