-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3.2 fine tuning generates repeated pattern towards the end of one epoch #735
Comments
I get a similar observation on my fine-tuning on custom dataset. Did you plot your training loss with wandb? I wonder whether this is a learning rate adjustment. Also is there a possibility to schedule the learning rate decay? |
Your loss seems to be fine, maybe train longer or increase learning rate? Repetitive answers usually mean that the model is still adapting to the new domain. |
Yup, those are probably the way out, just A100 rates are high and would like check if anyone has seen and solved the similar problem.
I added them manually. This is another thing wired about this repo, obviously almost all other fine-tuning repo has enbaled the cosine learning rate (idefics, intern, qwen, Aria etc.) but not this repo. It makes me worried fine-tuning script has not been well tested in this repo |
yeah I agree, my finetuned model performs worse than a smaller LLaVA-Onevision finetuned model on my custom dataset. The loss doesn't manage to go as much down. Let's see if there are any significant updates in the coming weeks |
add these between line 118 and 120
and move |
System Info
PyTorch version: 2.4.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
GPU Type and number: A100 80GB x 1
Information
🐛 Describe the bug
GPUS=1 PER_DEVICE_BATCH_SIZE=2 nohup sh src/llama_lora_finetune.sh
Error logs
I was fine tuning the
meta-llama/Llama-3.2-11B-Vision-Instruct
with https://archive.physionet.org/mimic2/ with 170k image-text pairs. The checkpoints till 0.7 of one epoch generate output text as expected. But starting 0.8 epoch, the checkpoints and so forthgenerate a repeated pattern as belowExpected behavior
Expecting the model to generate normal output at 0.8 epoch training and after.
The text was updated successfully, but these errors were encountered: