[Wan] Potential bug. #306

dorpxam · 2025-03-09T10:19:17Z

I've update the latest main branch and start a Wan LoRA training over same corpus than a previous test. I don't remember when and with which version of the repo, but I had managed to start training on the Wan1.3 model. I had stopped training at a few hundred steps, the training was running at about twenty seconds per iteration but it was running on my 16GB GPU.

By the way, I made a request to kijai about loading LoRA with his ComfyUI node. You can download one of the previously trained checkpoints in one of my messages.

kijai/ComfyUI-WanVideoWrapper#176

Here, I use the --enable_precomputation and as you can see, take ~30 minutes on my GPU. But right after the precomputation, I have this bug.

+ export WANDB_MODE=online
+ WANDB_MODE=online
+ export NCCL_P2P_DISABLE=1
+ NCCL_P2P_DISABLE=1
+ export TORCH_NCCL_ENABLE_MONITORING=0
+ TORCH_NCCL_ENABLE_MONITORING=0
+ export FINETRAINERS_LOG_LEVEL=DEBUG
+ FINETRAINERS_LOG_LEVEL=DEBUG
+ BACKEND=ptd
+ NUM_GPUS=1
+ CUDA_VISIBLE_DEVICES=0
+ TRAINING_DATASET_CONFIG=scripts/wan/elizabeth/training.json
+ VALIDATION_DATASET_FILE=scripts/wan/elizabeth/validation.json
+ DDP_1='--parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 1 --cp_degree 1 --tp_degree 1'
+ DDP_2='--parallel_backend ptd --pp_degree 1 --dp_degree 2 --dp_shards 1 --cp_degree 1 --tp_degree 1'
+ DDP_4='--parallel_backend ptd --pp_degree 1 --dp_degree 4 --dp_shards 1 --cp_degree 1 --tp_degree 1'
+ FSDP_2='--parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 2 --cp_degree 1 --tp_degree 1'
+ FSDP_4='--parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 4 --cp_degree 1 --tp_degree 1'
+ HSDP_2_2='--parallel_backend ptd --pp_degree 1 --dp_degree 2 --dp_shards 2 --cp_degree 1 --tp_degree 1'
+ parallel_cmd=($DDP_1)
+ model_cmd=(--model_name "wan" --pretrained_model_name_or_path "Wan-AI/Wan2.1-T2V-1.3B-Diffusers")
+ dataset_cmd=(--dataset_config $TRAINING_DATASET_CONFIG --dataset_shuffle_buffer_size 24 --precomputation_items 24 --precomputation_once --enable_precomputation)
+ dataloader_cmd=(--dataloader_num_workers 0)
+ diffusion_cmd=(--flow_weighting_scheme "logit_normal")
+ training_cmd=(--training_type "lora" --seed 42 --batch_size 1 --train_steps 2400 --rank 32 --lora_alpha 32 --target_modules "blocks.*(to_q|to_k|to_v|to_out.0)" --gradient_accumulation_steps 1 --gradient_checkpointing --checkpointing_steps 96 --checkpointing_limit 1000 --resume_from_checkpoint latest --enable_slicing --enable_tiling)
+ optimizer_cmd=(--optimizer "adamw" --lr 5e-5 --lr_scheduler "constant_with_warmup" --lr_warmup_steps 240 --lr_num_cycles 1 --beta1 0.9 --beta2 0.99 --weight_decay 1e-4 --epsilon 1e-8 --max_grad_norm 1.0)
+ validation_cmd=()
+ miscellaneous_cmd=(--tracker_name "finetrainers-wan" --output_dir "/mnt/f/training/wan/elizabeth" --init_timeout 600 --nccl_timeout 600 --report_to "wandb")
+ '[' ptd == accelerate ']'
+ '[' ptd == ptd ']'
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ torchrun --standalone --nnodes=1 --nproc_per_node=1 --rdzv_backend c10d --rdzv_endpoint=localhost:0 train.py --parallel_backend ptd --pp_degree 1 --dp_degree 1 --dp_shards 1 --cp_degree 1 --tp_degree 1 --model_name wan --pretrained_model_name_or_path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --dataset_config scripts/wan/elizabeth/training.json --dataset_shuffle_buffer_size 24 --precomputation_items 24 --precomputation_once --enable_precomputation --dataloader_num_workers 0 --flow_weighting_scheme logit_normal --training_type lora --seed 42 --batch_size 1 --train_steps 2400 --rank 32 --lora_alpha 32 --target_modules 'blocks.*(to_q|to_k|to_v|to_out.0)' --gradient_accumulation_steps 1 --gradient_checkpointing --checkpointing_steps 96 --checkpointing_limit 1000 --resume_from_checkpoint latest --enable_slicing --enable_tiling --optimizer adamw --lr 5e-5 --lr_scheduler constant_with_warmup --lr_warmup_steps 240 --lr_num_cycles 1 --beta1 0.9 --beta2 0.99 --weight_decay 1e-4 --epsilon 1e-8 --max_grad_norm 1.0 --tracker_name finetrainers-wan --output_dir /mnt/f/training/wan/elizabeth --init_timeout 600 --nccl_timeout 600 --report_to wandb
2025-03-09 10:40:17,200 - finetrainers - DEBUG - Successfully imported bitsandbytes version 0.45.3
DEBUG:finetrainers:Successfully imported bitsandbytes version 0.45.3
2025-03-09 10:40:17,203 - finetrainers - DEBUG - Remaining unparsed arguments: []
DEBUG:finetrainers:Remaining unparsed arguments: []
2025-03-09 10:40:17,844 - finetrainers - INFO - Initialized parallel state with:
  - World size: 1
  - Pipeline parallel degree: 1
  - Data parallel degree: 1
  - Context parallel degree: 1
  - Tensor parallel degree: 1
  - Data parallel shards: 1

INFO:finetrainers:Initialized parallel state with:
  - World size: 1
  - Pipeline parallel degree: 1
  - Data parallel degree: 1
  - Context parallel degree: 1
  - Tensor parallel degree: 1
  - Data parallel shards: 1

2025-03-09 10:40:17,845 - finetrainers - DEBUG - Device mesh: DeviceMesh('cuda', 0)
DEBUG:finetrainers:Device mesh: DeviceMesh('cuda', 0)
2025-03-09 10:40:17,845 - finetrainers - DEBUG - Enabling determinism: {'global_rank': 0, 'seed': 42}
DEBUG:finetrainers:Enabling determinism: {'global_rank': 0, 'seed': 42}
2025-03-09 10:40:17,846 - finetrainers - INFO - Initializing models
INFO:finetrainers:Initializing models
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 26214.40it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.27it/s]
2025-03-09 10:40:18,851 - finetrainers - INFO - Initializing trainable parameters
INFO:finetrainers:Initializing trainable parameters
2025-03-09 10:40:18,851 - finetrainers - INFO - Finetuning transformer with PEFT parameters
INFO:finetrainers:Finetuning transformer with PEFT parameters
2025-03-09 10:40:19,908 - finetrainers - INFO - Initializing optimizer and lr scheduler
INFO:finetrainers:Initializing optimizer and lr scheduler
2025-03-09 10:40:19,911 - finetrainers - DEBUG - PytorchDTensorParallelBackend::prepare_optimizer completed!
DEBUG:finetrainers:PytorchDTensorParallelBackend::prepare_optimizer completed!
2025-03-09 10:40:19,912 - finetrainers - INFO - Initialized FineTrainers
INFO:finetrainers:Initialized FineTrainers
2025-03-09 10:40:19,912 - finetrainers - INFO - Initializing trackers: ['wandb']. Logging to log_dir='logs'
INFO:finetrainers:Initializing trackers: ['wandb']. Logging to log_dir='logs'
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: maxprod2021 to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in logs/wandb/run-20250309_104020-rupjw9m3
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run confused-feather-21
wandb: ⭐️ View project at https://wandb.ai/------/finetrainers-wan
wandb: 🚀 View run at https://wandb.ai/------/finetrainers-wan/runs/------
2025-03-09 10:40:20,912 - finetrainers - INFO - WandB logging enabled
INFO:finetrainers:WandB logging enabled
2025-03-09 10:40:20,913 - finetrainers - INFO - Initializing dataset and dataloader
INFO:finetrainers:Initializing dataset and dataloader
2025-03-09 10:40:20,914 - finetrainers - INFO - Training configured to use 1 datasets
INFO:finetrainers:Training configured to use 1 datasets
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 18832.18it/s]
2025-03-09 10:40:21,999 - finetrainers - INFO - Initialized dataset: /mnt/f/datasets/wan/elizabeth
INFO:finetrainers:Initialized dataset: /mnt/f/datasets/wan/elizabeth
2025-03-09 10:40:22,000 - finetrainers - DEBUG - PytorchDTensorParallelBackend::prepare_dataset completed!
DEBUG:finetrainers:PytorchDTensorParallelBackend::prepare_dataset completed!
2025-03-09 10:40:22,000 - finetrainers - INFO - Initializing IterableDatasetPreprocessingWrapper for the dataset with the following configuration:
  - Dataset Type: video
  - ID Token: renruthtebazile
  - Image Resolution Buckets: None
  - Video Resolution Buckets: [[27, 512, 768], [29, 512, 768], [33, 512, 768], [39, 512, 768], [41, 512, 768], [46, 512, 768], [49, 512, 768], [53, 512, 768], [58, 512, 768], [62, 512, 768], [83, 512, 768], [96, 512, 768], [116, 512, 768], [168, 512, 768], [257, 512, 768]]
  - Reshape Mode: bicubic
  - Remove Common LLM Caption Prefixes: False

INFO:finetrainers:Initializing IterableDatasetPreprocessingWrapper for the dataset with the following configuration:
  - Dataset Type: video
  - ID Token: renruthtebazile
  - Image Resolution Buckets: None
  - Video Resolution Buckets: [[27, 512, 768], [29, 512, 768], [33, 512, 768], [39, 512, 768], [41, 512, 768], [46, 512, 768], [49, 512, 768], [53, 512, 768], [58, 512, 768], [62, 512, 768], [83, 512, 768], [96, 512, 768], [116, 512, 768], [168, 512, 768], [257, 512, 768]]
  - Reshape Mode: bicubic
  - Remove Common LLM Caption Prefixes: False

2025-03-09 10:40:22,000 - finetrainers - INFO - Initializing IterableCombinedDataset with the following configuration:
  - Number of Datasets: 1
  - Buffer Size: 24
  - Shuffle: True

INFO:finetrainers:Initializing IterableCombinedDataset with the following configuration:
  - Number of Datasets: 1
  - Buffer Size: 24
  - Shuffle: True

2025-03-09 10:40:22,000 - finetrainers - DEBUG - PytorchDTensorParallelBackend::prepare_dataloader completed!
DEBUG:finetrainers:PytorchDTensorParallelBackend::prepare_dataloader completed!
2025-03-09 10:40:22,000 - finetrainers - INFO - Checkpointing enabled. Checkpoints will be stored in '/mnt/f/training/wan/elizabeth'
INFO:finetrainers:Checkpointing enabled. Checkpoints will be stored in '/mnt/f/training/wan/elizabeth'
2025-03-09 10:40:22,003 - finetrainers - INFO - Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0
INFO:finetrainers:Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0
2025-03-09 10:40:27,893 - finetrainers - INFO - Loaded checkpoint in 8.70 seconds.
INFO:finetrainers:Loaded checkpoint in 8.70 seconds.
2025-03-09 10:40:27,893 - finetrainers - INFO - Starting training
INFO:finetrainers:Starting training
2025-03-09 10:40:27,894 - finetrainers - INFO - Memory before training start: {
    "memory_allocated": 2.738,
    "memory_reserved": 2.967,
    "max_memory_allocated": 2.739,
    "max_memory_reserved": 2.967
}
INFO:finetrainers:Memory before training start: {
    "memory_allocated": 2.738,
    "memory_reserved": 2.967,
    "max_memory_allocated": 2.739,
    "max_memory_reserved": 2.967
}
2025-03-09 10:40:27,894 - finetrainers - INFO - Training configuration: {
    "trainable parameters": 23592960,
    "train steps": 2400,
    "per-replica batch size": 1,
    "global batch size": 1,
    "gradient accumulation steps": 1
}
INFO:finetrainers:Training configuration: {
    "trainable parameters": 23592960,
    "train steps": 2400,
    "per-replica batch size": 1,
    "global batch size": 1,
    "gradient accumulation steps": 1
}
Training steps:   0%|                                                                                                                                                                                                                                                                               | 0/2400 [00:00<?, ?it/s]2025-03-09 10:40:27,947 - finetrainers - DEBUG - Deleting files: []
DEBUG:finetrainers:Deleting files: []
2025-03-09 10:40:27,947 - finetrainers - INFO - Precomputed condition & latent data exhausted. Loading & preprocessing new data.
INFO:finetrainers:Precomputed condition & latent data exhausted. Loading & preprocessing new data.
/home/dorpxam/anaconda3/envs/finetrainers/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py:490: UserWarning: Detected an existing checkpoint in /mnt/f/training/wan/elizabeth/finetrainers_step_0/.metadata, overwriting since self.overwrite=True. Past version 2.5 of PyTorch, `overwrite` will default to False. Set this variable to True to maintain this functionality or False to raise when an existing checkpoint is found.
  warnings.warn(
2025-03-09 10:40:56,886 - finetrainers - INFO - Saved checkpoint in 31.79 seconds at step 0. Directory: /mnt/f/training/wan/elizabeth/finetrainers_step_0
INFO:finetrainers:Saved checkpoint in 31.79 seconds at step 0. Directory: /mnt/f/training/wan/elizabeth/finetrainers_step_0
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 9226.36it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.30it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.34it/s2025-03-09 10:41:18,582 - finetrainers - INFO - Starting IterableCombinedDataset with 1 datasets                                                                                                                                                                                                       | 0/24 [00:00<?, ?it/s]
INFO:finetrainers:Starting IterableCombinedDataset with 1 datasets
                                                                                                                                                                                                                                                                                                                            2025-03-09 10:41:18,583 - finetrainers - INFO - Starting IterableDatasetPreprocessingWrapper for the dataset
INFO:finetrainers:Starting IterableDatasetPreprocessingWrapper for the dataset                                                                                                                                                                                                                        | 0/24 [00:00<?, ?it/s]
Filling buffer from data iterator 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:05<00:00,  4.18it/s]
Processing data on rank 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:11<00:00,  2.01it/s]
Processing data on rank 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [25:26<00:00, 63.60s/it]
2025-03-09 11:06:57,670 - finetrainers - INFO - Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [25:26<00:00, 42.50s/it]
INFO:finetrainers:Loading checkpoint from '/mnt/f/training/wan/elizabeth/finetrainers_step_0' at step 0
2025-03-09 11:07:03,759 - finetrainers - INFO - Loaded checkpoint in 8.79 seconds.
INFO:finetrainers:Loaded checkpoint in 8.79 seconds.
2025-03-09 11:07:03,802 - finetrainers - DEBUG - Starting training step (1/2400)
DEBUG:finetrainers:Starting training step (1/2400)
2025-03-09 11:07:03,857 - finetrainers - ERROR - Error during training: 'NoneType' object is not callable
ERROR:finetrainers:Error during training: 'NoneType' object is not callable
wandb:
wandb: 🚀 View run confused-feather-21 at: https://wandb.ai/------/finetrainers-wan/runs/------
wandb: ⭐️ View project at: https://wandb.ai/------/finetrainers-wan
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: logs/wandb/run-20250309_104020-rupjw9m3/logs
2025-03-09 11:07:04,962 - finetrainers - ERROR - An error occurred during training: 'NoneType' object is not callable
ERROR:finetrainers:An error occurred during training: 'NoneType' object is not callable
2025-03-09 11:07:04,962 - finetrainers - ERROR - Traceback (most recent call last):
  File "/home/dorpxam/ai/finetrainers/train.py", line 70, in main
    trainer.run()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 97, in run
    raise e
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 92, in run
    self._train()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 467, in _train
    pred, target, sigmas = self.model_specification.forward(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dorpxam/ai/finetrainers/finetrainers/models/wan/base_specification.py", line 301, in forward
    pred = transformer(
           ^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

ERROR:finetrainers:Traceback (most recent call last):
  File "/home/dorpxam/ai/finetrainers/train.py", line 70, in main
    trainer.run()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 97, in run
    raise e
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 92, in run
    self._train()
  File "/home/dorpxam/ai/finetrainers/finetrainers/trainer/sft_trainer/trainer.py", line 467, in _train
    pred, target, sigmas = self.model_specification.forward(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dorpxam/ai/finetrainers/finetrainers/models/wan/base_specification.py", line 301, in forward
    pred = transformer(
           ^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

Training steps:   0%|                                                                                                                                                                                                                                                                               | 0/2400 [26:37<?, ?it/s]
+ echo -ne '-------------------- Finished executing script --------------------\n\n'
-------------------- Finished executing script --------------------

The text was updated successfully, but these errors were encountered:

a-r-r-o-w · 2025-03-09T13:27:07Z

Hey, sorry for the inconvinience. I just fixed the issue. It wasn't caught by the unit tests because of a bug in the tests causing everything to pass by default. I'll do some more improvements on the actual reason for this problem soon.

BTW, please hold of a bit longer from training with Wan. A change was made upstream in diffusers which causes current training to be broken: huggingface/diffusers#10998. I'll work on the fix asap

a-r-r-o-w · 2025-03-09T15:45:27Z

Opened #308 to fix the scaling related changes from upstream. I've queued a run to verify it's correct and same as before, so I'll update here once that's done

dorpxam · 2025-03-09T16:01:24Z

No problem. I understand perfectly. Take your time man. I want to check the training over the Wan model but I'm very interesting about the SkyReels training too, if the I2V fit under 16GB ;)

a-r-r-o-w mentioned this issue Mar 9, 2025

Fix for #306 #307

Merged

a-r-r-o-w closed this as completed in #307 Mar 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wan] Potential bug. #306

[Wan] Potential bug. #306

dorpxam commented Mar 9, 2025 •

edited

Loading

a-r-r-o-w commented Mar 9, 2025

a-r-r-o-w commented Mar 9, 2025

dorpxam commented Mar 9, 2025

[Wan] Potential bug. #306

[Wan] Potential bug. #306

Comments

dorpxam commented Mar 9, 2025 • edited Loading

a-r-r-o-w commented Mar 9, 2025

a-r-r-o-w commented Mar 9, 2025

dorpxam commented Mar 9, 2025

dorpxam commented Mar 9, 2025 •

edited

Loading