Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LTXV LoRA Training] W&B charts reading #288

Open
dorpxam opened this issue Mar 4, 2025 · 5 comments
Open

[LTXV LoRA Training] W&B charts reading #288

dorpxam opened this issue Mar 4, 2025 · 5 comments

Comments

@dorpxam
Copy link

dorpxam commented Mar 4, 2025

Hi. Thank's a lot for the latest finetrainers version, the refactoring (dataset...) look great. I am currently doing some (local) tests before proceeding with heavier training. I need community help about the step/epoch loss reading in the W&B charts.

I've build a little subset of video to test a 'concept-lora'. 28 videos of 257 frames in 896x512. All videos are 'womans walking in a street', with different background, dress, and so on. Because I run on my local RTX4080 (16GB), I do the training using legacy script and optimizations: --precompute_conditions / --layerwise_upcasting_modules transformer.

I run the training on WSL2 (Ubuntu) using deepspeed with --gradient_accumulation_steps set to 1. The --train_epochs is set to 200 for 5600 steps (1 epoch = 28 steps), I save a checkpoint at each epoch for a further analysis of the evolution of the training because I do not use validation steps during training. I will compute a post-validation video for each checkpoint to see the training evolution.

I use a LoRA rank of 256 (with a alpha to 128). The lr value is 2e-5 with adamw optimizer and a weight decay of 1e-4.

Here is the W&B chart. In the screenshot the best loss is seen at epoch 18 (step 504) and achieve a 0.274 epoch loss. Since my screenshot, the best epoch loss is 0.272 at epoch 51 (step 1428).

Image

I have few (maybe stupid and naive) questions about the charts:

  • I use a 'Running Average' curve (set to 100) over the main 'Time Weighted EMA' loss values. Is this a good way to check the global tendency of the convergence ?
  • Is the current charts (almost) reasonable to you based on your own experience?
  • Why there's some large spikes in the grad_norm chart? Is this common?
  • In this LoRA training, it seem that the best loss is something around 0.3 (+/- 30%), is this a decent value for a LoRA ? What is the best loss you can achieve in LoRA training?

Thank you very much for your help.

@a-r-r-o-w
Copy link
Owner

Wow, thanks for the super detailed thread!

I use a 'Running Average' curve (set to 100) over the main 'Time Weighted EMA' loss values. Is this a good way to check the global tendency of the convergence ?

With diffusion models, especially at small scale training, the loss is meaningless. It is more of less just random noise. This is because the task that you're trying to teach the model is hard -- it's like saying: here is some data point with t amount of noise added (where t is random from 0 to 1000), please predict what the data point would like with t - 1 noise.

As t is random, the loss is going to be all over the place. With longer training runs, you will notice the spiky-ness reduce over time and have a gradual bumpy-but-downwards curve. A better signal to observe is:

  • validation loss on a held out dataset (we haven't yet implemented a evaluate method though)
  • loss over training steps for fixed intervals of timesteps (for example, make a plot for timesteps between 0-100, 100-200, 200-300, ...). Working on adding this but was facing some timeout issues, so need to find the time to debug:
    # TODO(aryan): handle non-SchedulerWrapper schedulers (probably not required eventually) since they might not be dicts
    # TODO(aryan): causes NCCL hang for some reason. look into later
    # logs.update(self.lr_scheduler.get_last_lr())
    # timesteps_table = wandb.Table(data=timesteps_buffer, columns=["step", "timesteps"])
    # logs["timesteps"] = wandb.plot.scatter(
    # timesteps_table, "step", "timesteps", title="Timesteps distribution"
    # )
    timesteps_buffer = []

Is the current charts (almost) reasonable to you based on your own experience?

Perfectly reasonable and looks correct. I see that for your dataset, grad norm peaks at ~0.6. This is okay, but if you want to prevent big weight changes, you can set --max_grad_norm to something like 0.3. Typically, okay values are upto ~10, but in my experience, clipping at 0.5-2 works best. You can also apply other gradient clipping strategies such as the one mentioned in Playground V3: https://arxiv.org/abs/2409.10695

Image

Why there's some large spikes in the grad_norm chart? Is this common?

I don't know the true reason (lacking a bit of necessary literature) but my guess is some data points at bigger timesteps (i.e. more noise added to original data point) result in worse predictions by the model, possibly resulting in bigger gradients. A spiky graph like this is okay, but you can try smoothing it out a little more by lowering the max_grad_norm (try experimenting the exact same settings with max_grad_norm=1.0 vs max_grad_norm=0.3 and seeing the generation quality + convergence time to reach the same quality)

In this LoRA training, it seem that the best loss is something around 0.3 (+/- 30%), is this a decent value for a LoRA ? What is the best loss you can achieve in LoRA training?

It does not really have any meaning here as mentioned above. Just looking at the validation samples and going by the eye would be better for small-scale LoRA training. In my experience, the best checkpoint is almost always the one with lowest validation loss in timestep range 700-800, 800-900, 900-1000. This is because the first few denoising steps are the most critical to generation quality (even for inference, we try to skew the sigmas distribution towards 1.0 using a quadratic schedule [see Flux sigmas schedule]). Finetrainers does not yet plot all this information but I'll try to have it soon

@a-r-r-o-w
Copy link
Owner

Also, for LTX, would definitely recommend you to use the latest main branch instead of the v0.0.1 release. This is because we added the first-frame conditioning which wasn't supported before. I've found it to consistently result in better generation quality

@dorpxam
Copy link
Author

dorpxam commented Mar 5, 2025

Thanks a lot for of theses details and a big big thanks for the first-frame conditioning!!! I will try if I can do a training using validation. The memory usage is very low with finetrainers and the speed is incredibly fast. I've reduce the corpus to 12 videos and get better loss mean value under 0.25 but for now the result is very weird for the full body anatomy. So I will test a character-LoRA (that is even better to have a good visual check for similarity) with the latest version of finetrainers. I will publish some result here for information.

So. If I understand correctly. The LoRA training do not need more than a thousand steps? I agree because my very first test with finetrainers using only ten images (not video) on a LoRA training was better around 600-700 steps (60-70 epochs so), but no really convaincing about similarity from the original model (probably a problem of choice about the corpus image).

Because checkpoint is fixed at regular steps/epochs maybe a cool option to add would be : --save_best_loss_in_range [from_step,to_step] allowing to record even better checkpoint if a better loss appear in between two checkpointing ? Something like the EarlyStopping callback in Keras/TF you can override to make a EarlyStoppingAtMinLoss for example, but here without stopping, just a record is triggered if a loss value is better than previous recorded one and only in a specific range.

In my screenshoot, I have the best loss at step 504 for a loss 0.27405 but if you check the minima, the best loss is at step 468 for a loss of 0.18262 (so something like 10% of gain). But unfortunaly, only steps 448, 476 are recorded. The only way to get the most perfect loss in any range is to record the checkpoint at each step that is not only insane by breaking the training speed but what a hell! A checkpoint is something like 7GB on the hard-drive in my case (here for 3K steps, more than a half of a terabytes for a record at each epoch, imagine for each step!!!)

I really think that can be a good option assuming that an ideal range can be easily estimated based on the size of the corpus.

@a-r-r-o-w
Copy link
Owner

The LoRA training do not need more than a thousand steps?

It really depends on the kinds of effects you want. If the model is already doing somewhat reasonably with the kinds of generations you want, it will take only a few thousand training steps to make it learn the exact effect. For significantly harder things, it can take well over multiple thousand steps.

Because checkpoint is fixed at regular steps/epochs maybe a cool option to add would be : --save_best_loss_in_range [from_step,to_step] allowing to record even better checkpoint if a better loss appear in between two checkpointing ? Something like the EarlyStopping callback in Keras/TF you can override to make a EarlyStoppingAtMinLoss for example, but here without stopping, just a record is triggered if a loss value is better than previous recorded one and only in a specific range.

I think I'll have to think about how to do the callbacks. Since I'm trying to make this more of a library for training any diffusion model, this is a great recommendation and I will take it into account eventually!

In my screenshoot, I have the best loss at step 504 for a loss 0.27405 but if you check the minima, the best loss is at step 468 for a loss of 0.18262 (so something like 10% of gain). But unfortunaly, only steps 448, 476 are recorded. The only way to get the most perfect loss in any range is to record the checkpoint at each step that is not only insane by breaking the training speed but what a hell! A checkpoint is something like 7GB on the hard-drive in my case (here for 3K steps, more than a half of a terabytes for a record at each epoch, imagine for each step!!!)

The best loss here has no meaning tbh, especially at the per-step level. Think about what happens when the randomly sampled timestep is low, therefore lesser noise is added to the original video, therefore making the prediction task almost trivial -- it will result in a low loss. Since the timesteps are essentially random, the loss curve will look random. Would recommend playing around a bit for a few thousand steps. I've pushed a couple of checkpoints recently with WandB logs attached in the model description here: https://huggingface.co/finetrainers. They might help you get a sense of what to expect. This is a particularly good example why loss is meaningless in few step settings: https://wandb.ai/aryanvs/finetrainers-cogview4

@dorpxam
Copy link
Author

dorpxam commented Mar 8, 2025

You're right. I think there's a lot of parameters that enter in consideration and the plot of W&B as well as the avg loss are not very intuitive. For example, I've done a complete training of a 'character lora' using the latest version and using the default parameters similar than the ones in the examples directory.

The result chart for a 7K steps is:

Image

Even if it look good, starting convergence around 1.5K steps. Unfortunately, the result is not what was expected.

The dataset is a 36 videos of 512x768 with multiple duration from 1 to 10 seconds. I record the checkpoint all 10 epochs (360 steps) and this is for example one post-training validation from 720 to 7200 steps (1 to 18)

Image

This is a test using the same seed and the same prompt that the shot sample video in the corpus that look like that:

Image

Prompt: A woman with blonde hair styled in an updo stands amidst lush green palm leaves, her gaze directed towards the camera. She is adorned in a colorful, patterned bikini top featuring intricate paisley designs and lace-up details. The vibrant hues of her attire contrast beautifully against the verdant backdrop of the tropical foliage. The scene is bathed in natural sunlight, casting soft shadows that highlight the textures of both her attire and the surrounding leaves, while a warm glow enhances the serene ambiance. The camera remains stationary, capturing her poised stance and the tranquil beauty of the tropical setting. The scene is captured in real-life footage.

We can observe that while the 'character' characteristic transfer enter in 4 (step 2160 - loss 0.429) while the model already overfit, the best similarity (hair, face, bikini) comes in 15 (step 5400 - loss 0.373).

Similar consideration comes in others tests like this one (original video image at left):

Image

So, I decide to purge the corpus from 36 videos to 28 videos keeping most diversity. I decide to change the learning rate to most 'aggressive' one : 2e-4. For testing the 'best loss' trick, I've just hack the state_checkpoint.py for recording not only the regular checkpoints, but all the best checkpoints after step 480 (20th epoch for a 24 videos corpus)

The resulting chart from W&B to a little more than 3K steps:

Image

The resulting collected checkpoints (step | loss):

REGULAR:

 240 | 0.3099142909049988
 480 | 0.5274916291236877
 720 | 0.4725000560283661
 960 | 0.44087541103363037
1200 | 0.29765814542770386
1440 | 0.27066466212272644
1680 | 0.27854734659194946
1920 | 0.17857927083969116
2160 | 0.3707681894302368
2400 | 0.3354491889476776
2640 | 0.4812169075012207
2880 | 0.23259779810905457
3120 | 0.2744404077529907

HACK:

 484 | 0.26656320691108704
 493 | 0.2528856098651886
 503 | 0.25103336572647095
 519 | 0.22497935593128204
 563 | 0.2245454341173172
 609 | 0.21948173642158508
 610 | 0.17619693279266357
1763 | 0.17162838578224182
1769 | 0.16234809160232544
1853 | 0.15810555219650269
1943 | 0.1543356329202652
2086 | 0.15105760097503662
2100 | 0.14727002382278442
2105 | 0.14326956868171692
2153 | 0.14221204817295074
2219 | 0.13765119016170502
2309 | 0.13728569447994232
2419 | 0.13613183796405792
2468 | 0.1351986825466156
2478 | 0.1348101794719696
2556 | 0.13059274852275848
2597 | 0.12541541457176208
2646 | 0.12515924870967865
2663 | 0.12095408141613007
2738 | 0.11594921350479126
2816 | 0.11423702538013458
2825 | 0.11247419565916061
2828 | 0.10601576417684555
3023 | 0.10420376807451248
3053 | 0.10133016854524612

For a most intuitive view, here a chart for theses values, upper line is the regular checkpoints, lower line the forced ones:

Image

I do not have doing intensive testing for now, but I can already confirm that the best (lower) loss not necessarily the best checkpoint. Here, two videos samples exported from the diffusers (257 frames / 30 fps / 50 steps for denoising, no STG or enhancement)

REGULAR CHECKPOINT | STEP 720 | LOSS 0.472
https://github.com/user-attachments/assets/aa9a2a5a-b4f6-439a-a188-8ed5cd67b7e8

FORCED CHECKPOINT | STEP 610 | LOSS 0.176
https://github.com/user-attachments/assets/dde940d6-9cfb-4678-be7b-0edf66aa8882

We can see that the low loss value do not impact the quality, here the regular checkpoint at 0.472 is clearly better even if the result is not really at the best.

Good news, the 0.9.5 version of LTXV is here and it's cool. Even faster and seem to have a better rendering result. I need to extend my investigations to find the best way to train this model.

Maybe that will be cool to open a discussion section in your github for sharing experiment and do not spam the 'issue' section.

Thanks a lot for all.

PS: I will test the Wan 1.3B LoRA training too ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants