Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add GRPO Support #900

Open
tmostak opened this issue Feb 20, 2025 · 4 comments
Open

[FEATURE] Add GRPO Support #900

tmostak opened this issue Feb 20, 2025 · 4 comments
Labels
type/feature Feature request

Comments

@tmostak
Copy link

tmostak commented Feb 20, 2025

🚀 Feature

Add GRPO Support

Motivation

With the release of DeepSeek's R1 model, GRPO has been shown to be a powerful way to instill reasoning capabilities in models for cases where there is either labeled data or a verifier. This request is to add support to train a model with GRPO, perhaps with a focus on building reasoning abilities.

@tmostak tmostak added the type/feature Feature request label Feb 20, 2025
@sarthak247
Copy link
Contributor

Heyaaaaa!
I would like to take this. I've contributed to llmstudio before so am slightly familiar with the code base (#683 ). Was a bit occupied with life lately but I'm ready to start contributing again to h2o and other open source projects and I think this could be a good point to get back into the open source landscape.

I've read a bit about GRPO and DeepSeek but might need some support to pull this through though : )
Maybe some reading materials or sample code implementations might be great to begin with.

Regards,
Sarthak

@sarthak247
Copy link
Contributor

So, I was going through the paper and this is what I can understand so far.

  • Initialize model and for each question/instruction generate multiple completions (4-16)
  • Score these completions based on some reward function and then also normalize the score across the group of completions
  • Update the policy based on the scores from the reward function.

While I do have a rough boilerplate in my mind, I had some questions as to which or how should we use the reward functions? While there are some other implementations which have reward functions like unsloth, huggingface TRL, etc and which can be imported and used directly.

Or do we need to make them from scratch and then use our reward functions for GRPO training? Most of these reward functions are actually pretty similar like reward the completion length, check proper syntax is being followed (eg. and tags), maths reward functions, code, reasoning steps, etc.

@psinger
Copy link
Collaborator

psinger commented Mar 6, 2025

I think for starters you could rely on some basic reward functions that are also being used in other examples as you mentioned.

But, to make it really flexible we would need some way for the user to specify such reward functions, but this can happen as a follow-up.

One thing I would suggest is to look into the option to use vllm for generating the candidates, otherwise it will be too slow.

@sarthak247
Copy link
Contributor

@psinger Second this. Most of the implementations I have seen rely on vllm for the generation part as it makes it really fast to get up and going with things. Also, for the reward functions, for now, I will just add some of them and like you mentioned later on the user can choose which ones they want and which ones they don't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

No branches or pull requests

3 participants