llama3.2 fine tuning generates repeated pattern towards the end of one epoch #735

ruian1 · 2024-10-18T18:04:42Z

System Info

PyTorch version: 2.4.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
GPU Type and number: A100 80GB x 1

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

GPUS=1 PER_DEVICE_BATCH_SIZE=2 nohup sh src/llama_lora_finetune.sh

Error logs

I was fine tuning the meta-llama/Llama-3.2-11B-Vision-Instruct with https://archive.physionet.org/mimic2/ with 170k image-text pairs. The checkpoints till 0.7 of one epoch generate output text as expected. But starting 0.8 epoch, the checkpoints and so forthgenerate a repeated pattern as below

sp leading jack jack jack coach Port jack jack jackzens jack jack pit jack jackrap jack jack Port jackansk jack jack jackrex jackeman jack jack jack jack jack ピ jackleading sp jackrex jack jack jack jack jack jack jack jack jack jack jack jack jack jack jack jackrex jack jack jackeman pit pit jack jack jack jackleading jack jack pig jack jack pit jack jack event jack jack jack pit jackstorybook jackeman jack jack leading jackchl jack jack jack jack jackjack sp leading jack jack jackleading jack jack jack pigleading jack ピ jack pit pit jack jack ピ jack jack jackrexindow jack jack jack jack jack jack jack jack jackzens jack pitansk jackrap jack jack jack leadingsid pit jack jack jack jack jack jack jack pit jack pit jack jack jack jackeman jack pit pit jack jack jack jack jack jack jack jack jackjack jackjack jack jack jack pit jack pit jack jack jack jack jack event jack jack jack pit jack jack697storybookrex jack jack jack jack jack leading pit pit jack jack jack jack jackzens jack jack jack pit jack jack jack jack jack pit jack jack jack jack jack jack pit697 jackleading jack jack jack pit pit jack jack jack jack jack jack jack jackrexrap jack jack jackjack jack jack jack jack jack pitrapeman jack jack event coach jack jack jack jack jack jack Pose jack jackrap jack jack Pose jack jack jack jack jackjack pit jack jack event pit pit jack jack jack coach jack jack jack jack pit Pose jack pig jack jackzens_ENUMstorybook jack jack jackrapsid pit jack pit jack jack jack jackjack jack jack jack jack jack jackrexindow jack jack jack jack jack coach jack jack jack jack jack jackeman pit jack pit jack pit pitrap jack jackleading jack jack jack jack jack jack jack jackrap jack jack jack jack coach pit jack jack jack coach jackansk jack jack jack pit pig jack jack jack jack jack jack jack jackrap pit jack jackzensansk pit jacksid jack jack jack coach jack jack jack jack jack jack jack jack jackansk ピ jackrap jack jack jack jack jack jack jackzens jack_ENUM jack pit jack jack jack jack jack jack jack jackjack pig ピ pit coach jack jack pit jack jack jack jack jackchl jack coach jack jack jack jack jack jack jack jack jack jack jack pit pitjack jackjack jack jackrex jack jack jackstorybook jackeman pit jack jack jack jack Pose jack jack jack jack leading jack jack jack Pose jack jack jack jack pig jack pit event jack jack jack coach jack jack jack pitrex302 jack jack jack jack jack jack jack jack pit jack jack pigzens jack jackrap

Expected behavior

Expecting the model to generate normal output at 0.8 epoch training and after.

The text was updated successfully, but these errors were encountered:

NicoZenith · 2024-10-18T22:56:34Z

I get a similar observation on my fine-tuning on custom dataset. Did you plot your training loss with wandb?

I wonder whether this is a learning rate adjustment. Also is there a possibility to schedule the learning rate decay?

ruian1 · 2024-10-21T03:57:23Z

I get a similar observation on my fine-tuning on custom dataset. Did you plot your training loss with wandb?

I wonder whether this is a learning rate adjustment. Also is there a possibility to schedule the learning rate decay?

I printed to tensorboard, take a look at my loss below, I had to smooth it with 1.0 so you can see how it drops. I applied a cosine lr with 0.03 warm_up ratio, and 0.01 weight_decay. What I don't understand is why the model corrupted somewhere between 0.7 and 0.8 of the epoch..

NicoZenith · 2024-10-21T07:02:51Z

Your loss seems to be fine, maybe train longer or increase learning rate? Repetitive answers usually mean that the model is still adapting to the new domain.
Btw how do you set up warmup ratio and cosine schedule? They are not available arguments in the fine tuning script, as far as I know

ruian1 · 2024-10-21T16:57:27Z

Your loss seems to be fine, maybe train longer or increase learning rate? Repetitive answers usually mean that the model is still adapting to the new domain.

Yup, those are probably the way out, just A100 rates are high and would like check if anyone has seen and solved the similar problem.

Btw how do you set up warmup ratio and cosine schedule? They are not available arguments in the fine tuning script, as far as I know

I added them manually. This is another thing wired about this repo, obviously almost all other fine-tuning repo has enbaled the cosine learning rate (idefics, intern, qwen, Aria etc.) but not this repo. It makes me worried fine-tuning script has not been well tested in this repo

NicoZenith · 2024-10-25T09:19:41Z

yeah I agree, my finetuned model performs worse than a smaller LLaVA-Onevision finetuned model on my custom dataset. The loss doesn't manage to go as much down. Let's see if there are any significant updates in the coming weeks
How do you add the learning rate schedular manually?

ruian1 · 2024-11-01T21:32:24Z

yeah I agree, my finetuned model performs worse than a smaller LLaVA-Onevision finetuned model on my custom dataset. The loss doesn't manage to go as much down. Let's see if there are any significant updates in the coming weeks How do you add the learning rate schedular manually?

add these between line 118 and 120
https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/utils/train_utils.py#L118

    epoch_times = []
    checkpoint_times = []
    results = {}
    best_val_loss = float("inf")
    total_train_steps = 0
    max_steps_reached = False  # Flag to indicate max training steps reached
    # Start the training loop

    update_steps = 0  # Counter for steps where the model parameters are updated

    total_length = len(train_dataloader) // gradient_accumulation_steps
    total_steps = train_config.num_epochs * total_length
    warmup_steps = int(train_config.warmup_ratio * total_steps)

    print(f"total_length: {total_length}, total_steps: {total_steps}, warmup_steps: {warmup_steps}")

    def lr_lambda(current_step):
        if current_step < warmup_steps:
            return float(current_step) / float(max(1, warmup_steps))  # Linear warmup
        else:
            return 1.0

    warmup_scheduler = LambdaLR(optimizer, lr_lambda)
    cosine_scheduler = CosineAnnealingLR(
        optimizer, T_max=total_steps - warmup_steps, eta_min=0.0
    )
    lr_scheduler = SequentialLR(
        optimizer,
        schedulers=[warmup_scheduler, cosine_scheduler],
        milestones=[warmup_steps],
    )

and move lr_scheduler.step() to after optimizer.zero_grad()

xuhang-2 · 2024-11-11T02:45:24Z

Hi, there. I also have this problem when I SFT 11b model. The loss is below:

And the output is repeating pattern.
When I check the instruct model without SFT, the output is still repeat pattern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3.2 fine tuning generates repeated pattern towards the end of one epoch #735

llama3.2 fine tuning generates repeated pattern towards the end of one epoch #735

ruian1 commented Oct 18, 2024 •

edited

Loading

NicoZenith commented Oct 18, 2024

ruian1 commented Oct 21, 2024 •

edited

Loading

NicoZenith commented Oct 21, 2024

ruian1 commented Oct 21, 2024

NicoZenith commented Oct 25, 2024

ruian1 commented Nov 1, 2024

xuhang-2 commented Nov 11, 2024

llama3.2 fine tuning generates repeated pattern towards the end of one epoch #735

llama3.2 fine tuning generates repeated pattern towards the end of one epoch #735

Comments

ruian1 commented Oct 18, 2024 • edited Loading

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

NicoZenith commented Oct 18, 2024

ruian1 commented Oct 21, 2024 • edited Loading

NicoZenith commented Oct 21, 2024

ruian1 commented Oct 21, 2024

NicoZenith commented Oct 25, 2024

ruian1 commented Nov 1, 2024

xuhang-2 commented Nov 11, 2024

ruian1 commented Oct 18, 2024 •

edited

Loading

ruian1 commented Oct 21, 2024 •

edited

Loading