[FEATURE] Add GRPO Support #900

tmostak · 2025-02-20T21:20:56Z

🚀 Feature

Add GRPO Support

Motivation

With the release of DeepSeek's R1 model, GRPO has been shown to be a powerful way to instill reasoning capabilities in models for cases where there is either labeled data or a verifier. This request is to add support to train a model with GRPO, perhaps with a focus on building reasoning abilities.

sarthak247 · 2025-02-24T01:49:05Z

Heyaaaaa!
I would like to take this. I've contributed to llmstudio before so am slightly familiar with the code base (#683 ). Was a bit occupied with life lately but I'm ready to start contributing again to h2o and other open source projects and I think this could be a good point to get back into the open source landscape.

I've read a bit about GRPO and DeepSeek but might need some support to pull this through though : )
Maybe some reading materials or sample code implementations might be great to begin with.

Regards,
Sarthak

sarthak247 · 2025-03-04T06:48:19Z

So, I was going through the paper and this is what I can understand so far.

Initialize model and for each question/instruction generate multiple completions (4-16)
Score these completions based on some reward function and then also normalize the score across the group of completions
Update the policy based on the scores from the reward function.

While I do have a rough boilerplate in my mind, I had some questions as to which or how should we use the reward functions? While there are some other implementations which have reward functions like unsloth, huggingface TRL, etc and which can be imported and used directly.

Or do we need to make them from scratch and then use our reward functions for GRPO training? Most of these reward functions are actually pretty similar like reward the completion length, check proper syntax is being followed (eg. and tags), maths reward functions, code, reasoning steps, etc.

psinger · 2025-03-06T14:18:37Z

I think for starters you could rely on some basic reward functions that are also being used in other examples as you mentioned.

But, to make it really flexible we would need some way for the user to specify such reward functions, but this can happen as a follow-up.

One thing I would suggest is to look into the option to use vllm for generating the candidates, otherwise it will be too slow.

sarthak247 · 2025-03-11T05:21:29Z

@psinger Second this. Most of the implementations I have seen rely on vllm for the generation part as it makes it really fast to get up and going with things. Also, for the reward functions, for now, I will just add some of them and like you mentioned later on the user can choose which ones they want and which ones they don't

tmostak added the type/feature Feature request label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add GRPO Support #900

[FEATURE] Add GRPO Support #900

tmostak commented Feb 20, 2025

sarthak247 commented Feb 24, 2025

sarthak247 commented Mar 4, 2025

psinger commented Mar 6, 2025

sarthak247 commented Mar 11, 2025

[FEATURE] Add GRPO Support #900

[FEATURE] Add GRPO Support #900

Comments

tmostak commented Feb 20, 2025

🚀 Feature

Motivation

sarthak247 commented Feb 24, 2025

sarthak247 commented Mar 4, 2025

psinger commented Mar 6, 2025

sarthak247 commented Mar 11, 2025