-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add GRPO Support #900
Comments
Heyaaaaa! I've read a bit about GRPO and DeepSeek but might need some support to pull this through though : ) Regards, |
So, I was going through the paper and this is what I can understand so far.
While I do have a rough boilerplate in my mind, I had some questions as to which or how should we use the reward functions? While there are some other implementations which have reward functions like unsloth, huggingface TRL, etc and which can be imported and used directly. Or do we need to make them from scratch and then use our reward functions for GRPO training? Most of these reward functions are actually pretty similar like reward the completion length, check proper syntax is being followed (eg. and tags), maths reward functions, code, reasoning steps, etc. |
I think for starters you could rely on some basic reward functions that are also being used in other examples as you mentioned. But, to make it really flexible we would need some way for the user to specify such reward functions, but this can happen as a follow-up. One thing I would suggest is to look into the option to use vllm for generating the candidates, otherwise it will be too slow. |
@psinger Second this. Most of the implementations I have seen rely on vllm for the generation part as it makes it really fast to get up and going with things. Also, for the reward functions, for now, I will just add some of them and like you mentioned later on the user can choose which ones they want and which ones they don't |
🚀 Feature
Add GRPO Support
Motivation
With the release of DeepSeek's R1 model, GRPO has been shown to be a powerful way to instill reasoning capabilities in models for cases where there is either labeled data or a verifier. This request is to add support to train a model with GRPO, perhaps with a focus on building reasoning abilities.
The text was updated successfully, but these errors were encountered: