Cot loss masking #1298

paNikitin · 2025-02-23T09:56:57Z

This PR implements CoT loss masking and additional tokens append to model vocabulary with embeddings resizing.
For now there is only an ability to add special tokens.

awni · 2025-02-27T01:09:30Z

We should think carefully here about how to integrate this in a way that doesn't make the code too difficult to modify / maintain moving forward. Right now there are lots of little places that need to get updated to manage the COT loss masking you implemented and that makes things quite brittle.

First suggestion is I think it's ok if the resize of the model is a separate step (separate script) from the actual training.

Step 1 you resize a model/tokenizer into a new directory
Step 2 train a model from the directory you made

Second suggestion, I'm wondering if we can handle the loss masking by having a dataset which generates the right length start/stop values? That way we wouldn't need have all the downstream code be aware of the "reasoning" and "data" tokens. This should look similar to the way we do completion only fine tuning.

Also I'm wondering if you can explain the loss a bit. I notice it only incurs loss after the "[DATA]" token. But this part I don't understand:

    # masking loss before [DATA]; applying penalty for invalid seq
    valid_loss = (ce * loss_mask).sum(axis=1) / (mx.sum(loss_mask, axis=1) + 1e-8)
    final_loss = mx.where(valid_seq, valid_loss, penalty)  # 10.0 as invalid penalty

Could you explain it or point to a reference?

paNikitin · 2025-02-27T18:31:43Z

I completely agree. Yes, there are a lot of things to refine.
Yes, that makes sense
The idea was to let model's pretrained knowledge to be aligned under a teacher's model reasoning without breaking it (instead of intermediate steps aligning). So, that's why the loss is only computed after final response token.

I better to show results on some benchmarks in comparison to base model first.

paNikitin added 5 commits February 23, 2025 12:31

added cot loss masking training

68403f5

Update new_tokens.py

95d4422

Update lora.py

0f790c4

update upstream

a2b61af

Update trainer.py

5b7581f

paNikitin changed the title ~~Cot loss~~ Cot loss masking Feb 23, 2025

paNikitin added 4 commits February 23, 2025 15:24

fixed full dequantization mem leak

231f5e8

Update trainer.py

e2ace6f

added adapter additional tokens load on fuse

b2ab372

Update lora.py

09f5add

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cot loss masking #1298

Cot loss masking #1298

paNikitin commented Feb 23, 2025

awni commented Feb 27, 2025

paNikitin commented Feb 27, 2025 •

edited

Loading

Cot loss masking #1298

Are you sure you want to change the base?

Cot loss masking #1298

Conversation

paNikitin commented Feb 23, 2025

awni commented Feb 27, 2025

paNikitin commented Feb 27, 2025 • edited Loading

paNikitin commented Feb 27, 2025 •

edited

Loading