-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cot loss masking #1298
base: main
Are you sure you want to change the base?
Cot loss masking #1298
Conversation
We should think carefully here about how to integrate this in a way that doesn't make the code too difficult to modify / maintain moving forward. Right now there are lots of little places that need to get updated to manage the COT loss masking you implemented and that makes things quite brittle. First suggestion is I think it's ok if the resize of the model is a separate step (separate script) from the actual training.
Second suggestion, I'm wondering if we can handle the loss masking by having a dataset which generates the right length start/stop values? That way we wouldn't need have all the downstream code be aware of the "reasoning" and "data" tokens. This should look similar to the way we do completion only fine tuning. Also I'm wondering if you can explain the loss a bit. I notice it only incurs loss after the "[DATA]" token. But this part I don't understand:
Could you explain it or point to a reference? |
I better to show results on some benchmarks in comparison to base model first. |
This PR implements CoT loss masking and additional tokens append to model vocabulary with embeddings resizing.
For now there is only an ability to add special tokens.