Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Wan] Potential bug. #306

Closed
dorpxam opened this issue Mar 9, 2025 · 3 comments · Fixed by #307
Closed

[Wan] Potential bug. #306

dorpxam opened this issue Mar 9, 2025 · 3 comments · Fixed by #307

Comments

@dorpxam
Copy link

dorpxam commented Mar 9, 2025

I've update the latest main branch and start a Wan LoRA training over same corpus than a previous test. I don't remember when and with which version of the repo, but I had managed to start training on the Wan1.3 model. I had stopped training at a few hundred steps, the training was running at about twenty seconds per iteration but it was running on my 16GB GPU.

By the way, I made a request to kijai about loading LoRA with his ComfyUI node. You can download one of the previously trained checkpoints in one of my messages.

kijai/ComfyUI-WanVideoWrapper#176

Here, I use the --enable_precomputation and as you can see, take ~30 minutes on my GPU. But right after the precomputation, I have this bug.

+ export WANDB_MODE=online
+ WANDB_MODE=online
+ export NCCL_P2P_DISABLE=1
+ NCCL_P2P_DISABLE=1
+ export TORCH_NCCL_ENABLE_MONITORING=0
+ TORCH_NCCL_ENABLE_MONITORING=0
+ export FINETRAINERS_LOG_LEVEL=DEBUG
+ FINETRAINERS_LOG_LEVEL=DEBUG
+ BACKEND=ptd
+ NUM_GPUS=1
+ CUDA_VISIBLE_DEVICES=0
+ TRAINING_DATASET_CONFIG=scripts/wan/elizabeth/training.json
+ VALIDATION_DATASET_FILE=scripts/wan/elizabeth/validation.json
+ DDP_1='--parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 1 --cp_degree 1 --tp_degree 1'
+ DDP_2='--parallel_backend ptd --pp_degree 1 --dp_degree 2 --dp_shards 1 --cp_degree 1 --tp_degree 1'
+ DDP_4='--parallel_backend ptd --pp_degree 1 --dp_degree 4 --dp_shards 1 --cp_degree 1 --tp_degree 1'
+ FSDP_2='--parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 2 --cp_degree 1 --tp_degree 1'
+ FSDP_4='--parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 4 --cp_degree 1 --tp_degree 1'
+ HSDP_2_2='--parallel_backend ptd --pp_degree 1 --dp_degree 2 --dp_shards 2 --cp_degree 1 --tp_degree 1'
+ parallel_cmd=($DDP_1)
+ model_cmd=(--model_name "wan" --pretrained_model_name_or_path "Wan-AI/Wan2.1-T2V-1.3B-Diffusers")
+ dataset_cmd=(--dataset_config $TRAINING_DATASET_CONFIG --dataset_shuffle_buffer_size 24 --precomputation_items 24 --precomputation_once --enable_precomputation)
+ dataloader_cmd=(--dataloader_num_workers 0)
+ diffusion_cmd=(--flow_weighting_scheme "logit_normal")
+ training_cmd=(--training_type "lora" --seed 42 --batch_size 1 --train_steps 2400 --rank 32 --lora_alpha 32 --target_modules "blocks.*(to_q|to_k|to_v|to_out.0)" --gradient_accumulation_steps 1 --gradient_checkpointing --checkpointing_steps 96 --checkpointing_limit 1000 --resume_from_checkpoint latest --enable_slicing --enable_tiling)
+ optimizer_cmd=(--optimizer "adamw" --lr 5e-5 --lr_scheduler "constant_with_warmup" --lr_warmup_steps 240 --lr_num_cycles 1 --beta1 0.9 --beta2 0.99 --weight_decay 1e-4 --epsilon 1e-8 --max_grad_norm 1.0)
+ validation_cmd=()
+ miscellaneous_cmd=(--tracker_name "finetrainers-wan" --output_dir "/mnt/f/training/wan/elizabeth" --init_timeout 600 --nccl_timeout 600 --report_to "wandb")
+ '[' ptd == accelerate ']'
+ '[' ptd == ptd ']'
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ torchrun --standalone --nnodes=1 --nproc_per_node=1 --rdzv_backend c10d --rdzv_endpoint=localhost:0 train.py --parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 1 --cp_degree 1 --tp_degree 1 --model_name wan --pretrained_model_name_or_path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --dataset_config scripts/wan/elizabeth/training.json --dataset_shuffle_buffer_size 24 --precomputation_items 24 --precomputation_once --enable_precomputation --dataloader_num_workers 0 --flow_weighting_scheme logit_normal --training_type lora --seed 42 --batch_size 1 --train_steps 2400 --rank 32 --lora_alpha 32 --target_modules 'blocks.*(to_q|to_k|to_v|to_out.0)' --gradient_accumulation_steps 1 --gradient_checkpointing --checkpointing_steps 96 --checkpointing_limit 1000 --resume_from_checkpoint latest --enable_slicing --enable_tiling --optimizer adamw --lr 5e-5 --lr_scheduler constant_with_warmup --lr_warmup_steps 240 --lr_num_cycles 1 --beta1 0.9 --beta2 0.99 --weight_decay 1e-4 --epsilon 1e-8 --max_grad_norm 1.0 --tracker_name finetrainers-wan --output_dir /mnt/f/training/wan/elizabeth --init_timeout 600 --nccl_timeout 600 --report_to wandb
2025-03-09 10:40:17,200 - finetrainers - DEBUG - Successfully imported bitsandbytes version 0.45.3
DEBUG:finetrainers:Successfully imported bitsandbytes version 0.45.3
2025-03-09 10:40:17,203 - finetrainers - DEBUG - Remaining unparsed arguments: []
DEBUG:finetrainers:Remaining unparsed arguments: []
2025-03-09 10:40:17,844 - finetrainers - INFO - Initialized parallel state with:
  - World size: 1
  - Pipeline parallel degree: 1
  - Data parallel degree: 1
  - Context parallel degree: 1
  - Tensor parallel degree: 1
  - Data parallel shards: 1

INFO:finetrainers:Initialized parallel state with:
  - World size: 1
  - Pipeline parallel degree: 1
  - Data parallel degree: 1
  - Context parallel degree: 1
  - Tensor parallel degree: 1
  - Data parallel shards: 1

2025-03-09 10:40:17,845 - finetrainers - DEBUG - Device mesh: DeviceMesh('cuda', 0)
DEBUG:finetrainers:Device mesh: DeviceMesh('cuda', 0)
2025-03-09 10:40:17,845 - finetrainers - DEBUG - Enabling determinism: {'global_rank': 0, 'seed': 42}
DEBUG:finetrainers:Enabling determinism: {'global_rank': 0, 'seed': 42}
2025-03-09 10:40:17,846 - finetrainers - INFO - Initializing models
INFO:finetrainers:Initializing models
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 26214.40it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.27it/s]
2025-03-09 10:40:18,851 - finetrainers - INFO - Initializing trainable parameters
INFO:finetrainers:Initializing trainable parameters
2025-03-09 10:40:18,851 - finetrainers - INFO - Finetuning transformer with PEFT parameters
INFO:finetrainers:Finetuning transformer with PEFT parameters
2025-03-09 10:40:19,908 - finetrainers - INFO - Initializing optimizer and lr scheduler
INFO:finetrainers:Initializing optimizer and lr scheduler
2025-03-09 10:40:19,911 - finetrainers - DEBUG - PytorchDTensorParallelBackend::prepare_optimizer completed!
DEBUG:finetrainers:PytorchDTensorParallelBackend::prepare_optimizer completed!
2025-03-09 10:40:19,912 - finetrainers - INFO - Initialized FineTrainers
INFO:finetrainers:Initialized FineTrainers
2025-03-09 10:40:19,912 - finetrainers - INFO - Initializing trackers: ['wandb']. Logging to log_dir='logs'
INFO:finetrainers:Initializing trackers: ['wandb']. Logging to log_dir='logs'
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: maxprod2021 to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in logs/wandb/run-20250309_104020-rupjw9m3
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run confused-feather-21
wandb: ⭐️ View project at https://wandb.ai/------/finetrainers-wan
wandb: 🚀 View run at https://wandb.ai/------/finetrainers-wan/runs/------
2025-03-09 10:40:20,912 - finetrainers - INFO - WandB logging enabled
INFO:finetrainers:WandB logging enabled
2025-03-09 10:40:20,913 - finetrainers - INFO - Initializing dataset and dataloader
INFO:finetrainers:Initializing dataset and dataloader
2025-03-09 10:40:20,914 - finetrainers - INFO - Training configured to use 1 datasets
INFO:finetrainers:Training configured to use 1 datasets
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 18832.18it/s]
2025-03-09 10:40:21,999 - finetrainers - INFO - Initialized dataset: /mnt/f/datasets/wan/elizabeth
INFO:finetrainers:Initialized dataset: /mnt/f/datasets/wan/elizabeth
2025-03-09 10:40:22,000 - finetrainers - DEBUG - PytorchDTensorParallelBackend::prepare_dataset completed!
DEBUG:finetrainers:PytorchDTensorParallelBackend::prepare_dataset completed!
2025-03-09 10:40:22,000 - finetrainers - INFO - Initializing IterableDatasetPreprocessingWrapper for the dataset with the following configuration:
  - Dataset Type: video
  - ID Token: renruthtebazile
  - Image Resolution Buckets: None
  - Video Resolution Buckets: [[27, 512, 768], [29, 512, 768], [33, 512, 768], [39, 512, 768], [41, 512, 768], [46, 512, 768], [49, 512, 768], [53, 512, 768], [58, 512, 768], [62, 512, 768], [83, 512, 768], [96, 512, 768], [116, 512, 768], [168, 512, 768], [257, 512, 768]]
  - Reshape Mode: bicubic
  - Remove Common LLM Caption Prefixes: False

INFO:finetrainers:Initializing IterableDatasetPreprocessingWrapper for the dataset with the following configuration:
  - Dataset Type: video
  - ID Token: renruthtebazile
  - Image Resolution Buckets: None
  - Video Resolution Buckets: [[27, 512, 768], [29, 512, 768], [33, 512, 768], [39, 512, 768], [41, 512, 768], [46, 512, 768], [49, 512, 768], [53, 512, 768], [58, 512, 768], [62, 512, 768], [83, 512, 768], [96, 512, 768], [116, 512, 768], [168, 512, 768], [257, 512, 768]]
  - Reshape Mode: bicubic
  - Remove Common LLM Caption Prefixes: False

2025-03-09 10:40:22,000 - finetrainers - INFO - Initializing IterableCombinedDataset with the following configuration:
  - Number of Datasets: 1
  - Buffer Size: 24
  - Shuffle: True

INFO:finetrainers:Initializing IterableCombinedDataset with the following configuration:
  - Number of Datasets: 1
  - Buffer Size: 24
  - Shuffle: True

2025-03-09 10:40:22,000 - finetrainers - DEBUG - PytorchDTensorParallelBackend::prepare_dataloader completed!
DEBUG:finetrainers:PytorchDTensorParallelBackend::prepare_dataloader completed!
2025-03-09 10:40:22,000 - finetrainers - INFO - Checkpointing enabled. Checkpoints will be stored in '/mnt/f/training/wan/elizabeth'
INFO:finetrainers:Checkpointing enabled. Checkpoints will be stored in '/mnt/f/training/wan/elizabeth'
2025-03-09 10:40:22,003 - finetrainers - INFO - Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0
INFO:finetrainers:Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0
2025-03-09 10:40:27,893 - finetrainers - INFO - Loaded checkpoint in 8.70 seconds.
INFO:finetrainers:Loaded checkpoint in 8.70 seconds.
2025-03-09 10:40:27,893 - finetrainers - INFO - Starting training
INFO:finetrainers:Starting training
2025-03-09 10:40:27,894 - finetrainers - INFO - Memory before training start: {
    "memory_allocated": 2.738,
    "memory_reserved": 2.967,
    "max_memory_allocated": 2.739,
    "max_memory_reserved": 2.967
}
INFO:finetrainers:Memory before training start: {
    "memory_allocated": 2.738,
    "memory_reserved": 2.967,
    "max_memory_allocated": 2.739,
    "max_memory_reserved": 2.967
}
2025-03-09 10:40:27,894 - finetrainers - INFO - Training configuration: {
    "trainable parameters": 23592960,
    "train steps": 2400,
    "per-replica batch size": 1,
    "global batch size": 1,
    "gradient accumulation steps": 1
}
INFO:finetrainers:Training configuration: {
    "trainable parameters": 23592960,
    "train steps": 2400,
    "per-replica batch size": 1,
    "global batch size": 1,
    "gradient accumulation steps": 1
}
Training steps:   0%|                                                                                                                                                                                                                                                                               | 0/2400 [00:00<?, ?it/s]2025-03-09 10:40:27,947 - finetrainers - DEBUG - Deleting files: []
DEBUG:finetrainers:Deleting files: []
2025-03-09 10:40:27,947 - finetrainers - INFO - Precomputed condition & latent data exhausted. Loading & preprocessing new data.
INFO:finetrainers:Precomputed condition & latent data exhausted. Loading & preprocessing new data.
/home/dorpxam/anaconda3/envs/finetrainers/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py:490: UserWarning: Detected an existing checkpoint in /mnt/f/training/wan/elizabeth/finetrainers_step_0/.metadata, overwriting since self.overwrite=True. Past version 2.5 of PyTorch, `overwrite` will default to False. Set this variable to True to maintain this functionality or False to raise when an existing checkpoint is found.
  warnings.warn(
2025-03-09 10:40:56,886 - finetrainers - INFO - Saved checkpoint in 31.79 seconds at step 0. Directory: /mnt/f/training/wan/elizabeth/finetrainers_step_0
INFO:finetrainers:Saved checkpoint in 31.79 seconds at step 0. Directory: /mnt/f/training/wan/elizabeth/finetrainers_step_0
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 9226.36it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.30it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.34it/s2025-03-09 10:41:18,582 - finetrainers - INFO - Starting IterableCombinedDataset with 1 datasets                                                                                                                                                                                                       | 0/24 [00:00<?, ?it/s]
INFO:finetrainers:Starting IterableCombinedDataset with 1 datasets
                                                                                                                                                                                                                                                                                                                            2025-03-09 10:41:18,583 - finetrainers - INFO - Starting IterableDatasetPreprocessingWrapper for the dataset
INFO:finetrainers:Starting IterableDatasetPreprocessingWrapper for the dataset                                                                                                                                                                                                                        | 0/24 [00:00<?, ?it/s]
Filling buffer from data iterator 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:05<00:00,  4.18it/s]
Processing data on rank 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:11<00:00,  2.01it/s]
Processing data on rank 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [25:26<00:00, 63.60s/it]
2025-03-09 11:06:57,670 - finetrainers - INFO - Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [25:26<00:00, 42.50s/it]
INFO:finetrainers:Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0
2025-03-09 11:07:03,759 - finetrainers - INFO - Loaded checkpoint in 8.79 seconds.
INFO:finetrainers:Loaded checkpoint in 8.79 seconds.
2025-03-09 11:07:03,802 - finetrainers - DEBUG - Starting training step (1/2400)
DEBUG:finetrainers:Starting training step (1/2400)
2025-03-09 11:07:03,857 - finetrainers - ERROR - Error during training: 'NoneType' object is not callable
ERROR:finetrainers:Error during training: 'NoneType' object is not callable
wandb:
wandb: 🚀 View run confused-feather-21 at: https://wandb.ai/------/finetrainers-wan/runs/------
wandb: ⭐️ View project at: https://wandb.ai/------/finetrainers-wan
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: logs/wandb/run-20250309_104020-rupjw9m3/logs
2025-03-09 11:07:04,962 - finetrainers - ERROR - An error occurred during training: 'NoneType' object is not callable
ERROR:finetrainers:An error occurred during training: 'NoneType' object is not callable
2025-03-09 11:07:04,962 - finetrainers - ERROR - Traceback (most recent call last):
  File "/home/dorpxam/ai/finetrainers/train.py", line 70, in main
    trainer.run()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 97, in run
    raise e
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 92, in run
    self._train()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 467, in _train
    pred, target, sigmas = self.model_specification.forward(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dorpxam/ai/finetrainers/finetrainers/models/wan/base_specification.py", line 301, in forward
    pred = transformer(
           ^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

ERROR:finetrainers:Traceback (most recent call last):
  File "/home/dorpxam/ai/finetrainers/train.py", line 70, in main
    trainer.run()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 97, in run
    raise e
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 92, in run
    self._train()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 467, in _train
    pred, target, sigmas = self.model_specification.forward(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dorpxam/ai/finetrainers/finetrainers/models/wan/base_specification.py", line 301, in forward
    pred = transformer(
           ^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

Training steps:   0%|                                                                                                                                                                                                                                                                               | 0/2400 [26:37<?, ?it/s]
+ echo -ne '-------------------- Finished executing script --------------------\n\n'
-------------------- Finished executing script --------------------
@a-r-r-o-w
Copy link
Owner

Hey, sorry for the inconvinience. I just fixed the issue. It wasn't caught by the unit tests because of a bug in the tests causing everything to pass by default. I'll do some more improvements on the actual reason for this problem soon.

BTW, please hold of a bit longer from training with Wan. A change was made upstream in diffusers which causes current training to be broken: huggingface/diffusers#10998. I'll work on the fix asap

@a-r-r-o-w
Copy link
Owner

Opened #308 to fix the scaling related changes from upstream. I've queued a run to verify it's correct and same as before, so I'll update here once that's done

@dorpxam
Copy link
Author

dorpxam commented Mar 9, 2025

No problem. I understand perfectly. Take your time man. I want to check the training over the Wan model but I'm very interesting about the SkyReels training too, if the I2V fit under 16GB ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants