-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LTXV LoRA Training] W&B charts reading #288
Comments
Wow, thanks for the super detailed thread!
With diffusion models, especially at small scale training, the loss is meaningless. It is more of less just random noise. This is because the task that you're trying to teach the model is hard -- it's like saying: here is some data point with As
Perfectly reasonable and looks correct. I see that for your dataset, grad norm peaks at ~0.6. This is okay, but if you want to prevent big weight changes, you can set
I don't know the true reason (lacking a bit of necessary literature) but my guess is some data points at bigger timesteps (i.e. more noise added to original data point) result in worse predictions by the model, possibly resulting in bigger gradients. A spiky graph like this is okay, but you can try smoothing it out a little more by lowering the
It does not really have any meaning here as mentioned above. Just looking at the validation samples and going by the eye would be better for small-scale LoRA training. In my experience, the best checkpoint is almost always the one with lowest validation loss in timestep range 700-800, 800-900, 900-1000. This is because the first few denoising steps are the most critical to generation quality (even for inference, we try to skew the sigmas distribution towards 1.0 using a quadratic schedule [see Flux sigmas schedule]). Finetrainers does not yet plot all this information but I'll try to have it soon |
Also, for LTX, would definitely recommend you to use the latest |
Thanks a lot for of theses details and a big big thanks for the first-frame conditioning!!! I will try if I can do a training using validation. The memory usage is very low with finetrainers and the speed is incredibly fast. I've reduce the corpus to 12 videos and get better loss mean value under 0.25 but for now the result is very weird for the full body anatomy. So I will test a character-LoRA (that is even better to have a good visual check for similarity) with the latest version of finetrainers. I will publish some result here for information. So. If I understand correctly. The LoRA training do not need more than a thousand steps? I agree because my very first test with finetrainers using only ten images (not video) on a LoRA training was better around 600-700 steps (60-70 epochs so), but no really convaincing about similarity from the original model (probably a problem of choice about the corpus image). Because checkpoint is fixed at regular steps/epochs maybe a cool option to add would be : In my screenshoot, I have the best loss at step 504 for a loss 0.27405 but if you check the minima, the best loss is at step 468 for a loss of 0.18262 (so something like 10% of gain). But unfortunaly, only steps 448, 476 are recorded. The only way to get the most perfect loss in any range is to record the checkpoint at each step that is not only insane by breaking the training speed but what a hell! A checkpoint is something like 7GB on the hard-drive in my case (here for 3K steps, more than a half of a terabytes for a record at each epoch, imagine for each step!!!) I really think that can be a good option assuming that an ideal range can be easily estimated based on the size of the corpus. |
It really depends on the kinds of effects you want. If the model is already doing somewhat reasonably with the kinds of generations you want, it will take only a few thousand training steps to make it learn the exact effect. For significantly harder things, it can take well over multiple thousand steps.
I think I'll have to think about how to do the callbacks. Since I'm trying to make this more of a library for training any diffusion model, this is a great recommendation and I will take it into account eventually!
The best loss here has no meaning tbh, especially at the per-step level. Think about what happens when the randomly sampled timestep is low, therefore lesser noise is added to the original video, therefore making the prediction task almost trivial -- it will result in a low loss. Since the timesteps are essentially random, the loss curve will look random. Would recommend playing around a bit for a few thousand steps. I've pushed a couple of checkpoints recently with WandB logs attached in the model description here: https://huggingface.co/finetrainers. They might help you get a sense of what to expect. This is a particularly good example why loss is meaningless in few step settings: https://wandb.ai/aryanvs/finetrainers-cogview4 |
You're right. I think there's a lot of parameters that enter in consideration and the plot of W&B as well as the avg loss are not very intuitive. For example, I've done a complete training of a 'character lora' using the latest version and using the default parameters similar than the ones in the examples directory. The result chart for a 7K steps is: Even if it look good, starting convergence around 1.5K steps. Unfortunately, the result is not what was expected. The dataset is a 36 videos of 512x768 with multiple duration from 1 to 10 seconds. I record the checkpoint all 10 epochs (360 steps) and this is for example one post-training validation from 720 to 7200 steps (1 to 18) This is a test using the same seed and the same prompt that the shot sample video in the corpus that look like that:
We can observe that while the 'character' characteristic transfer enter in 4 (step 2160 - loss 0.429) while the model already overfit, the best similarity (hair, face, bikini) comes in 15 (step 5400 - loss 0.373). Similar consideration comes in others tests like this one (original video image at left): So, I decide to purge the corpus from 36 videos to 28 videos keeping most diversity. I decide to change the learning rate to most 'aggressive' one : The resulting chart from W&B to a little more than 3K steps: The resulting collected checkpoints (step | loss):
For a most intuitive view, here a chart for theses values, upper line is the regular checkpoints, lower line the forced ones: I do not have doing intensive testing for now, but I can already confirm that the best (lower) loss not necessarily the best checkpoint. Here, two videos samples exported from the diffusers (257 frames / 30 fps / 50 steps for denoising, no STG or enhancement) REGULAR CHECKPOINT | STEP 720 | LOSS 0.472 FORCED CHECKPOINT | STEP 610 | LOSS 0.176 We can see that the low loss value do not impact the quality, here the regular checkpoint at 0.472 is clearly better even if the result is not really at the best. Good news, the 0.9.5 version of LTXV is here and it's cool. Even faster and seem to have a better rendering result. I need to extend my investigations to find the best way to train this model. Maybe that will be cool to open a discussion section in your github for sharing experiment and do not spam the 'issue' section. Thanks a lot for all. PS: I will test the Wan 1.3B LoRA training too ;) |
Hi. Thank's a lot for the latest finetrainers version, the refactoring (dataset...) look great. I am currently doing some (local) tests before proceeding with heavier training. I need community help about the step/epoch loss reading in the W&B charts.
I've build a little subset of video to test a 'concept-lora'. 28 videos of 257 frames in 896x512. All videos are 'womans walking in a street', with different background, dress, and so on. Because I run on my local RTX4080 (16GB), I do the training using legacy script and optimizations:
--precompute_conditions / --layerwise_upcasting_modules transformer
.I run the training on WSL2 (Ubuntu) using deepspeed with
--gradient_accumulation_steps
set to 1. The--train_epochs
is set to200
for 5600 steps (1 epoch = 28 steps), I save a checkpoint at each epoch for a further analysis of the evolution of the training because I do not use validation steps during training. I will compute a post-validation video for each checkpoint to see the training evolution.I use a LoRA
rank
of256
(with aalpha
to128
). Thelr
value is2e-5
withadamw
optimizer and aweight decay
of1e-4
.Here is the W&B chart. In the screenshot the best loss is seen at epoch 18 (step 504) and achieve a
0.274
epoch loss. Since my screenshot, the best epoch loss is0.272
at epoch 51 (step 1428).I have few (maybe stupid and naive) questions about the charts:
0.3
(+/- 30%), is this a decent value for a LoRA ? What is the best loss you can achieve in LoRA training?Thank you very much for your help.
The text was updated successfully, but these errors were encountered: