You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793]
W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793] *****************************************
W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0305 17:32:23.653000 36217 site-packages/torch/distributed/run.py:793] *****************************************
Could not load Sliding Tile Attention.
--> loading model from data/hunyuan
Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.
Total training parameters = 12821.012544 M
--> Initializing FSDP with sharding strategy: full
--> applying fdsp activation checkpointing...
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
capturable: False
differentiable: False
eps: 1e-08
foreach: None
fused: None
lr: 1e-05
maximize: False
weight_decay: 0.01
)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
--> applying fdsp activation checkpointing...
wandb: Currently logged in as: 774320139. Use `wandb login --relogin` to force relogin
--> applying fdsp activation checkpointing...
wandb: Tracking run with wandb version 0.18.5
wandb: Run data is saved locally in /sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/wandb/run-20250305_173416-0xlwolf1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fresh-dust-2
wandb: ⭐️ View project at xxx
wandb: 🚀 View run at xxx
***** Running training *****
Num examples = 8212
Dataloader size = 2053
Num Epochs = 1
Resume training from step 0
Instantaneous batch size per device = 1
Total train batch size (w. data & sequence parallel, accumulation) = 1.0
Gradient Accumulation steps = 1
Total optimization steps = 2000
Total training parameters per FSDP shard = 3.205253136 B
Master weight dtype: torch.float32
Steps: 0%| | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
[rank1]: Traceback (most recent call last):
[rank1]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank1]: main(args)
[rank1]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank1]: loss, grad_norm = train_one_step(
[rank1]: ^^^^^^^^^^^^^^^
[rank1]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank1]: model_pred = transformer(**input_kwargs)[0]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank1]: output = self._fsdp_wrapped_module(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank1]: mask_strategy = [[None] * len(self.heads_num)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: TypeError: object of type 'int' has no len()
[rank3]: Traceback (most recent call last):
[rank3]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank3]: main(args)
[rank3]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank3]: loss, grad_norm = train_one_step(
[rank3]: ^^^^^^^^^^^^^^^
[rank3]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank3]: model_pred = transformer(**input_kwargs)[0]
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank3]: output = self._fsdp_wrapped_module(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank3]: mask_strategy = [[None] * len(self.heads_num)
[rank3]: ^^^^^^^^^^^^^^^^^^^
[rank3]: TypeError: object of type 'int' has no len()
[rank2]: Traceback (most recent call last):
[rank2]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank2]: main(args)
[rank2]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank2]: loss, grad_norm = train_one_step(
[rank2]: ^^^^^^^^^^^^^^^
[rank2]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank2]: model_pred = transformer(**input_kwargs)[0]
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank2]: output = self._fsdp_wrapped_module(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank2]: mask_strategy = [[None] * len(self.heads_num)
[rank2]: ^^^^^^^^^^^^^^^^^^^
[rank2]: TypeError: object of type 'int' has no len()
File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
main(args)
File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
loss, grad_norm = train_one_step(
^^^^^^^^^^^^^^^
File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
model_pred = transformer(**input_kwargs)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
mask_strategy = [[None] * len(self.heads_num)
^^^^^^^^^^^^^^^^^^^
TypeError: object of type 'int' has no len()
[rank0]: Traceback (most recent call last):
[rank0]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 663, in <module>
[rank0]: main(args)
[rank0]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 368, in main
[rank0]: loss, grad_norm = train_one_step(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/train.py", line 146, in train_one_step
[rank0]: model_pred = transformer(**input_kwargs)[0]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[rank0]: output = self._fsdp_wrapped_module(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/sds_wangby/models/cjy/med_vid/code-wrs/FastVideo/fastvideo/models/hunyuan/modules/models.py", line 527, in forward
[rank0]: mask_strategy = [[None] * len(self.heads_num)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: object of type 'int' has no len()
W0305 17:34:20.453000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 36297 closing signal SIGTERM
W0305 17:34:20.458000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 36299 closing signal SIGTERM
W0305 17:34:20.460000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 36300 closing signal SIGTERM
E0305 17:34:21.191000 36217 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 36298) of binary: /opt/conda/envs/fastvideo/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/fastvideo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/fastvideo/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
fastvideo/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-03-05_17:34:20
host : esj6lnmgd7osq-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 36298)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Environment
finetune_hunyuan.sh
Describe the bug
Reproduction
preprocess_hunyuan_data.sh
The text was updated successfully, but these errors were encountered: