GRPO
Juan Vera
February 2025
Proximal Policy Optimization
PPO optimizes agents by the following surrogate objective:
where
~
~ is the state-action pair at time step in the trajectory , which is constructed by the policy , as , where we have
~ is an ambiguous reward function.
Note that the function is an implicit indirect way to minimize the divergence of the current policy and a reference policy (model at the prior set of during training).
Framing this in a setting for language models, we can construct this as:
where
~ is the query-answer pair of the language model (old answer, current query, as and respectively)
~
~ is equivalent to the "action", as the output to the language model
~ is the input or query
~ is the previous sequence of tokens.
is the advantage function, which can be computed using Generalized Advantage Estimation based on a learned critic which estimates the value function .
In PPO, we need to train a value function alongside the policy model. The issue is that the reward model can become overoptimized, such that it begins to overfit on a poor policy network, beginning to provide high reward signal to actions which should be tied to low reward (this is given since we aim to , and if we train the reward model alongside an untrained policy, we can be prone to incorrect optimization)
So, we can introduce a penalty based on the KL divergence as:
where is a penalty based on a fraction of the divergence of the current policy with the old policy.
The intuition behind this is that we don't want to diverge from too much, causing unstable training. Therefore, if the divergence is the measure of divergence from the two, adding a penalty as a fraction () of the divergence will reduce the reward proportional to the amount of divergence and consequently the objective function will be reduced.
Therefore, the policy gradients, are also reduced in magnitude, to prevent unstable / large updates.
The reason we want to prevent unstable updates is to prevent the policy network to overfit to an critic network (which yields incorrect rewards as it isn't trained well), prior to the critic network properly fitting to the approximate reward function.
PPO make use of an advantage function to reduce variance in t he gradient updates (), but learning a value function can take significantly increase memory and computation costs (as it needs to be similar size).
In LLMs the reward model only assigns a reward to the final token, making the reward signal sparse with respect to other intermediate predictions, making it hard for the value function to estimate values at each token in the sequence (you don't want the intermediate sentences in the output to make little sense, with only the ending forming a sensible output.)
Group Relative Policy Optimization
GRPO removes the need for the approximation of the value function, and instead uses the average reward of multiple sampled outputs, in repsonse to the same question, as the baseline.
For each question , GRPo samples a group of outputs ,from the old policy and then optimizes the policy model by maximizing the objective:
GRPO purely only uses a reward model () as the critic, to estimate the immediate reward at time , reducing the need for extra computational power that would be needed with a value model.
Instead GRPO uses the average rward of multiple sampled outputs, produced in the response to the same question ( query, ), as the baseline.
Consider the typical advantage function in PPO as:
where is the estimated reward of being in the current state , by taking the specific action .
Note that is defined recursively for all possible future states and actions, after taking the action , which can be seen via it's expanded bellman equation:
or more concisely as:
in PPO, we typically don't have an accurate way to estimate of the state-action value function, . This is as relies on actions that are to be taken at where is the count of your rollout.
Noting the immense count of possible actions that can be taken in a continuous action space at the current state , and the exponentially increasing set of total possible as we consider different , this becomes computationally infeasible.
So PPO typically estimates the -function in terms of the value function, which allows the Advantage function to be written in terms of .
Given the two bellman equations for and , we can rewrite them in terms of each other as:
and therefore, an advantage function can be rewritten as:
where then the Generalized Advantage Estimation, which estimates the advantage function, can be constructed as:
Note that the construction for is the advantage function (eq 2), rewritten with a discount factor , to prioritize short-term rewards.
Even here, using the , can become computationally expensive as the value model which estimates is typically of similar size with respect to the policy network, . In SOTA language modelling, DeepSeekMath-7B for instance, you'd need a model of similar capacity, which can be computationally infeasible to train.
Or another example, DeepSeek-V3-671B. Pretty huge policy network and likely large value network. Training a large value network alongisde the LLM, and taking into account the multiple rollouts needed to construct a set of trajectories, , in a continuous action space can be very costly.
So GRPO, simply rids of the need for and instead estimates the advantage as the as the difference between the reward for the current output and the group of outputs (collected via rollouts from the old policy), as seen in eq 1.
All that's left is to , compute , and train via backpropagation and gradient ascent.