Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about lora train #191

Open
2 tasks
syyxsxx opened this issue Jan 7, 2025 · 9 comments
Open
2 tasks

question about lora train #191

syyxsxx opened this issue Jan 7, 2025 · 9 comments

Comments

@syyxsxx
Copy link

syyxsxx commented Jan 7, 2025

System Info / 系統信息

diffuser: from source

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

i have trained the lora with cogvideox and hunyuan using same dataset
Cogvideox works well, but Hunyuan has no effect at all. Is there anything I can pay attention to or adjust the parameters here

Expected behavior / 期待表现

问题得到解决

@a-r-r-o-w
Copy link
Owner

a-r-r-o-w commented Jan 7, 2025

This is expected behaviour and surprised me too.

Hunyuan has been hard to train loras for with good effects unless the training is done with more steps. CogVideoX is able to learn concepts in a lower amount of steps.

For example, with CogVideoX, it starts to learn the Steamboat disney dataset well within 2000-3000 steps, but the same takes HunyuanVideo about 7000-8000 steps. It is most likely due to not having the optimal training hyperparameters for hunyuan yet, so would encourage more experimentation with higher steps, different hyperparameters, etc.

I've tried cakify-effect training for HunyuanVideo with (5000 steps, lr 1e-5, 183 videos, resolutions 17x512x768 49x512x768 81x512x768, adamw), but the results are not very promising. After speaking with a few others that are trying to do work on this too, it seems like multi aspect-ratio and lots of examples seem to work better. The most promising run (not mine but someone from an art community) has been with first training only on images of cakes being cut, followed by videos, so would maybe try doing the same. Here is some examples from my best training run of 5000 steps:

fc681c8a-b3a8-484a-a01a-0fbf04de93df.mp4
b.mp4
d5169e4f-1ef4-49d8-8b11-1a4f89aa34ea.mp4
2653f1f0-caf1-4e0c-a8a6-7cfe30d54250.mp4

Another thing that's worked for me is training with just images. This was discovered to be possible by someone else, so will not take any credits, but it may also lead to lower amount of motion in videos generated with the lora enabled. These are my results from a 10000 step run and 400 images of fake pokemons:

output.mp4

Left to right is 4000 steps, 6000 steps, 8000 steps and 10000 steps. As can be seen, until 4000 steps, the model did not learn the type of creatures I wanted to generate at all, but eventually converged. This maybe points to require learning_rate/weight_decay/optimizer tuning

@Symbiomatrix
Copy link

@a-r-r-o-w Would it perhaps be effective to train the lora simultaneously on a subject from images alongside a regularisation dataset, containing random video samples unrelated to it, to prevent overfitting to stills?

@cseti007
Copy link

This is expected behaviour and surprised me too.

Hunyuan has been hard to train loras for with good effects unless the training is done with more steps. CogVideoX is able to learn concepts in a lower amount of steps.

For example, with CogVideoX, it starts to learn the Steamboat disney dataset well within 2000-3000 steps, but the same takes HunyuanVideo about 7000-8000 steps. It is most likely due to not having the optimal training hyperparameters for hunyuan yet, so would encourage more experimentation with higher steps, different hyperparameters, etc.

I've tried cakify-effect training for HunyuanVideo with (5000 steps, lr 1e-5, 183 videos, resolutions 17x512x768 49x512x768 81x512x768, adamw), but the results are not very promising. After speaking with a few others that are trying to do work on this too, it seems like multi aspect-ratio and lots of examples seem to work better. The most promising run (not mine but someone from an art community) has been with first training only on images of cakes being cut, followed by videos, so would maybe try doing the same. Here is some examples from my best training run of 5000 steps:

fc681c8a-b3a8-484a-a01a-0fbf04de93df.mp4
b.mp4
d5169e4f-1ef4-49d8-8b11-1a4f89aa34ea.mp4
2653f1f0-caf1-4e0c-a8a6-7cfe30d54250.mp4
Another thing that's worked for me is training with just images. This was discovered to be possible by someone else, so will not take any credits, but it may also lead to lower amount of motion in videos generated with the lora enabled. These are my results from a 10000 step run and 400 images of fake pokemons:

output.mp4
Left to right is 4000 steps, 6000 steps, 8000 steps and 10000 steps. As can be seen, until 4000 steps, the model did not learn the type of creatures I wanted to generate at all, but eventually converged. This maybe points to require learning_rate/weight_decay/optimizer tuning

Hi!
So far I've achieved the best results when I trained with both images and videos.
When I trained only with videos, I also found that it took about 6000 steps for good results. However, when I used both images and videos together, I achieved good results sooner
I haven't tried it with your repo yet, I used diffusion-pipe for this. Is this possible to use images and videos together with yours as well?

@syyxsxx
Copy link
Author

syyxsxx commented Jan 14, 2025

@a-r-r-o-w
hi thank you for your reply
i trained lora using different lr = 1e-4 1e-5 1e-6 on different datasets with step=2000 and 8*h100, But did they all yield good results
What other hyperparameters can I adjust or do you have any suggestions

@a-r-r-o-w
Copy link
Owner

@a-r-r-o-w Would it perhaps be effective to train the lora simultaneously on a subject from images alongside a regularisation dataset, containing random video samples unrelated to it, to prevent overfitting to stills?

@Symbiomatrix Yes, I do plan to add support for prior loss soon. We need to work on some data loading experience and improvements first, after which I'll address this.

Hi!
So far I've achieved the best results when I trained with both images and videos.
When I trained only with videos, I also found that it took about 6000 steps for good results. However, when I used both images and videos together, I achieved good results sooner
I haven't tried it with your repo yet, I used diffusion-pipe for this. Is this possible to use images and videos together with yours as well?

Hi @cseti007, nice to see you here! Yes, loading both images and videos should be possible. The dataset format is the same, and you just need to point to the image files like you do the video files. Simple example here - you can combine both videos/images however you want as long as the metadata points to the correct files. Maybe helpful but badly written docs.

i trained lora using different lr = 1e-4 1e-5 1e-6 on different datasets with step=2000 and 8*h100, But did they all yield good results
What other hyperparameters can I adjust or do you have any suggestions

I really don't have any perfect recommendations tbh and am still exploring myself to find settings that work fast (in low number of train steps) and produce the exact effect/character I'm looking for. LR between 1e-4 to 1e-6 works best. You can try lowering weight decay to something like 1e-4 or 1e-5 too to reduce weight penalty in lora when trying to overfit it to something specific. Other than that, there's not really much to play with without a better understanding of the training dynamics of each model...

You should probably use logit_normal for --flow_weighting_scheme if you're not already, since that is what both LTXV and HunyuanVideo were trained with. You can try understanding what removing each layer of Hunyuan does, and target training to specific layers for specific data. You can also try playing with --flow_shift for training - Hunyuan paper has a nice section mentioning using values between 7 and 17 for inference, but I don't particularly know if modifying this is helpful for training.

@julia-0105
Copy link

@a-r-r-o-w Hello, I have trained lora 10000 steps using LR 1e-5 and the black and white Mickey Mouse video dataset, but when using the prompt in the prompt file, lora has no significant effect. What is the difference between using lora and not using lora to generate videos when you refer to works best? Also, can you provide me with the prompt you generated? Thank you very much!

@zqh0253
Copy link

zqh0253 commented Jan 17, 2025

t effect. What is the difference between using lora and not using lora to generate videos when you refer to works best? A

Hey, have you resolved the issue? I faced a similar problem as well. I trained LoRA for 5,000 steps (batch size of 8, learning rate of 3e-5, beta values of 0.9 and 0.95, and weight decay of 1e-5) using the CogVideox-2b model on the Mickey Mouse video dataset, but the LoRA didn’t show any significant impact.

@zqh0253
Copy link

zqh0253 commented Jan 17, 2025

This is expected behaviour and surprised me too.

Hunyuan has been hard to train loras for with good effects unless the training is done with more steps. CogVideoX is able to learn concepts in a lower amount of steps.

For example, with CogVideoX, it starts to learn the Steamboat disney dataset well within 2000-3000 steps, but the same takes HunyuanVideo about 7000-8000 steps. It is most likely due to not having the optimal training hyperparameters for hunyuan yet, so would encourage more experimentation with higher steps, different hyperparameters, etc.

I've tried cakify-effect training for HunyuanVideo with (5000 steps, lr 1e-5, 183 videos, resolutions 17x512x768 49x512x768 81x512x768, adamw), but the results are not very promising. After speaking with a few others that are trying to do work on this too, it seems like multi aspect-ratio and lots of examples seem to work better. The most promising run (not mine but someone from an art community) has been with first training only on images of cakes being cut, followed by videos, so would maybe try doing the same. Here is some examples from my best training run of 5000 steps:

fc681c8a-b3a8-484a-a01a-0fbf04de93df.mp4
b.mp4
d5169e4f-1ef4-49d8-8b11-1a4f89aa34ea.mp4
2653f1f0-caf1-4e0c-a8a6-7cfe30d54250.mp4
Another thing that's worked for me is training with just images. This was discovered to be possible by someone else, so will not take any credits, but it may also lead to lower amount of motion in videos generated with the lora enabled. These are my results from a 10000 step run and 400 images of fake pokemons:

output.mp4
Left to right is 4000 steps, 6000 steps, 8000 steps and 10000 steps. As can be seen, until 4000 steps, the model did not learn the type of creatures I wanted to generate at all, but eventually converged. This maybe points to require learning_rate/weight_decay/optimizer tuning

Could you let me know which version of CogVideox you were fine-tuning and the batch size you used for that? I experimented with CogVideox-2b (5,000 steps, batch size of 8, learning rate of 3e-5, beta values of 0.9 and 0.95, and weight decay of 1e-5) and found that LoRA training didn't yield noticeable effects. Thanks in advance!

@zhangvia
Copy link

did you try adamw8bit? it seems that loss won't decrease like using adamw when using adamw8bit. and result are bad when training the same steps as adamw @a-r-r-o-w

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants