Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] TypeError: object of type 'int' has no len() #247

Closed
WangRongsheng opened this issue Mar 6, 2025 · 2 comments
Closed

[Bug] TypeError: object of type 'int' has no len() #247

WangRongsheng opened this issue Mar 6, 2025 · 2 comments

Comments

@WangRongsheng
Copy link

Environment

finetune_hunyuan.sh

export WANDB_BASE_URL="https://api.wandb.ai"
export WANDB_MODE=online

torchrun --nnodes 1 --nproc_per_node 4 \
    fastvideo/train.py \
    --seed 42 \
    --pretrained_model_name_or_path data/hunyuan \
    --dit_model_name_or_path data/hunyuan/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt\
    --model_type "hunyuan" \
    --cache_dir data/.cache \
    --data_json_path /sds_wangby/models/cjy/med_vid/code-wrs/dataset/Image-Vid-Finetune-HunYuan/videos2caption.json \
    --validation_prompt_dir /sds_wangby/models/cjy/med_vid/code-wrs/dataset/Image-Vid-Finetune-HunYuan/validation \
    --gradient_checkpointing \
    --train_batch_size=1 \
    --num_latent_t 32 \
    --sp_size 4 \
    --train_sp_batch_size 1 \
    --dataloader_num_workers 4 \
    --gradient_accumulation_steps=1 \
    --max_train_steps=2000 \
    --learning_rate=1e-5 \
    --mixed_precision=bf16 \
    --checkpointing_steps=200 \
    --validation_steps 100 \
    --validation_sampling_steps 50 \
    --checkpoints_total_limit 3 \
    --allow_tf32 \
    --ema_start_step 0 \
    --cfg 0.0 \
    --ema_decay 0.999 \
    --log_validation \
    --output_dir=outputs/Full-Finetune-Hunyuan \
    --tracker_project_name Finetune-Hunyuan \
    --num_frames 93 \
    --num_height 480 \
    --num_width 720 \
    --shift 7 \
    --validation_guidance_scale "1.0" \

Describe the bug

W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793] 
W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793] *****************************************
W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793] *****************************************
Could not load Sliding Tile Attention.
--> loading model from data/hunyuan
Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.
  Total training parameters = 12821.012544 M
--> Initializing FSDP with sharding strategy: full
--> applying fdsp activation checkpointing...
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 1e-05
    maximize: False
    weight_decay: 0.01
)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
--> applying fdsp activation checkpointing...
wandb: Currently logged in as: 774320139. Use `wandb login --relogin` to force relogin
--> applying fdsp activation checkpointing...
wandb: Tracking run with wandb version 0.18.5
wandb: Run data is saved locally in /sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/wandb/run-20250305_173416-0xlwolf1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fresh-dust-2
wandb: ⭐️ View project at xxx
wandb: 🚀 View run at xxx
***** Running training *****
  Num examples = 8212
  Dataloader size = 2053
  Num Epochs = 1
  Resume training from step 0
  Instantaneous batch size per device = 1
  Total train batch size (w. data & sequence parallel, accumulation) = 1.0
  Gradient Accumulation steps = 1
  Total optimization steps = 2000
  Total training parameters per FSDP shard = 3.205253136 B
  Master weight dtype: torch.float32
Steps:   0%|                                                                                                                                                 | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
[rank1]: Traceback (most recent call last):
[rank1]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank1]:     main(args)
[rank1]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank1]:     loss, grad_norm = train_one_step(
[rank1]:                       ^^^^^^^^^^^^^^^
[rank1]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank1]:     model_pred = transformer(**input_kwargs)[0]
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank1]:     output = self._fsdp_wrapped_module(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank1]:     mask_strategy = [[None] * len(self.heads_num)
[rank1]:                               ^^^^^^^^^^^^^^^^^^^
[rank1]: TypeError: object of type 'int' has no len()
[rank3]: Traceback (most recent call last):
[rank3]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank3]:     main(args)
[rank3]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank3]:     loss, grad_norm = train_one_step(
[rank3]:                       ^^^^^^^^^^^^^^^
[rank3]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank3]:     model_pred = transformer(**input_kwargs)[0]
[rank3]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank3]:     output = self._fsdp_wrapped_module(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank3]:     mask_strategy = [[None] * len(self.heads_num)
[rank3]:                               ^^^^^^^^^^^^^^^^^^^
[rank3]: TypeError: object of type 'int' has no len()
[rank2]: Traceback (most recent call last):
[rank2]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank2]:     main(args)
[rank2]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank2]:     loss, grad_norm = train_one_step(
[rank2]:                       ^^^^^^^^^^^^^^^
[rank2]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank2]:     model_pred = transformer(**input_kwargs)[0]
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank2]:     output = self._fsdp_wrapped_module(*args, **kwargs)
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank2]:     mask_strategy = [[None] * len(self.heads_num)
[rank2]:                               ^^^^^^^^^^^^^^^^^^^
[rank2]: TypeError: object of type 'int' has no len()
  File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
    main(args)
  File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
    loss, grad_norm = train_one_step(
                      ^^^^^^^^^^^^^^^
  File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
    model_pred = transformer(**input_kwargs)[0]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
    mask_strategy = [[None] * len(self.heads_num)
                              ^^^^^^^^^^^^^^^^^^^
TypeError: object of type 'int' has no len()
[rank0]: Traceback (most recent call last):
[rank0]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank0]:     main(args)
[rank0]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank0]:     loss, grad_norm = train_one_step(
[rank0]:                       ^^^^^^^^^^^^^^^
[rank0]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank0]:     model_pred = transformer(**input_kwargs)[0]
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank0]:     output = self._fsdp_wrapped_module(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank0]:     mask_strategy = [[None] * len(self.heads_num)
[rank0]:                               ^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: object of type 'int' has no len()
W0305 17:34:20.453000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 36297 closing signal SIGTERM
W0305 17:34:20.458000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 36299 closing signal SIGTERM
W0305 17:34:20.460000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 36300 closing signal SIGTERM
E0305 17:34:21.191000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 36298) of binary: /opt/conda/envs/fastvideo/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/fastvideo/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
fastvideo/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-05_17:34:20
  host      : esj6lnmgd7osq-0
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 36298)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Reproduction

preprocess_hunyuan_data.sh

# export WANDB_MODE="offline"
# export MASTER_PORT=30000

GPU_NUM=2 # 2,4,8
MODEL_PATH="data/hunyuan"
MODEL_TYPE="hunyuan"
DATA_MERGE_PATH="/sds_wangby/models/cjy/med_vid/code-wrs/dataset/merge.txt"
OUTPUT_DIR="/sds_wangby/models/cjy/med_vid/code-wrs/dataset/Image-Vid-Finetune-HunYuan"
VALIDATION_PATH="assets/prompt.txt"

if [ ! -d "$OUTPUT_DIR" ]; then
    mkdir -p "$OUTPUT_DIR"
fi

touch "${OUTPUT_DIR}/videos2caption_temp.json"

torchrun --nproc_per_node=$GPU_NUM \
    fastvideo/data_preprocess/preprocess_vae_latents.py \
    --model_path $MODEL_PATH \
    --data_merge_path $DATA_MERGE_PATH \
    --train_batch_size=4 \
    --max_height=480 \
    --max_width=720 \
    --num_frames=93 \
    --dataloader_num_workers 32 \
    --output_dir=$OUTPUT_DIR \
    --model_type $MODEL_TYPE \
    --train_fps 24 

torchrun --nproc_per_node=$GPU_NUM \
    fastvideo/data_preprocess/preprocess_text_embeddings.py \
    --model_type $MODEL_TYPE \
    --model_path $MODEL_PATH \
    --output_dir=$OUTPUT_DIR 

torchrun --nproc_per_node=1 \
    fastvideo/data_preprocess/preprocess_validation_text_embeddings.py \
    --model_type $MODEL_TYPE \
    --model_path $MODEL_PATH \
    --output_dir=$OUTPUT_DIR \
    --validation_prompt_txt $VALIDATION_PATH
@BrianChen1129
Copy link
Collaborator

BrianChen1129 commented Mar 6, 2025

It should be fixed now!

@WangRongsheng
Copy link
Author

solved it! thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants