[Bug] fail to run finetune_hunyuan_hf_lora.sh #242

gobigrassland · 2025-03-05T10:41:08Z

Environment

None

Describe the bug

Following data_preprocess.md , i preprocess a small dataset Black-Myth-Wukong. Then run finetune_hunyuan_hf_lora.sh, and the program freezes.

The log info is as follow:

W0305 15:56:54.179762 140358333335360 torch/distributed/run.py:779]
W0305 15:56:54.179762 140358333335360 torch/distributed/run.py:779] *****************************************
W0305 15:56:54.179762 140358333335360 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0305 15:56:54.179762 140358333335360 torch/distributed/run.py:779] *****************************************
Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.Could not load Sliding Tile Attention.

Could not load Sliding Tile Attention.
--> loading model from /export/App/training_platform/PinoModel/cbfs/xx/FastVideo/data/hunyuan_diffusers
  Total training parameters = 40.894464 M
--> Initializing FSDP with sharding strategy: full
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 8e-05
    maximize: False
    weight_decay: 0.01
)
***** Running training *****
  Num examples = 88
  Dataloader size = 22
  Num Epochs = 273
  Resume training from step 0
  Instantaneous batch size per device = 1
  Total train batch size (w. data & sequence parallel, accumulation) = 4.0
  Gradient Accumulation steps = 4
  Total optimization steps = 6000
  Total training parameters per FSDP shard = 0.010223616 B
  Master weight dtype: torch.float32
Steps:   0%|                                                                                                                                                         | 0/6000 [00:00<?, ?it/s]/opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', arg  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                                                                                          | 1/6000 [36:55<3691:37:01, 2215.34s/it, loss=0.2984, step_time=2215.34s, grad_norm=nan]
train_loss: 0.29840828012675047, learning_rate: 8e-05, step_time: 2215.335999250412 ,avg_step_time: 2215.335999250412,  grad_norm: nan, step: 1[rank0]:[E305 16:45:40.958950730 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3369, OpType=ALLTOALL, NumelIn=610560, NumelOut=610560, Time
out(ms)=600000) ran for 600000 milliseconds before timing out.
[rank0]:[E305 16:45:40.959708714 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 3369, last enqueued NCCL work: 3372, last completed NCCL work: 3368.
[rank1]:[E305 16:45:40.019785199 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3369, OpType=ALLTOALL, NumelIn=610560, NumelOut=610560, Time
out(ms)=600000) ran for 600065 milliseconds before timing out.[rank1]:[E305 16:45:40.020167215 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 3369, last enqueued NCCL work: 3372, last compl
eted NCCL work: 3368.
[rank3]:[E305 16:45:40.044371672 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3369, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600090 milliseconds before timing out.
[rank3]:[E305 16:45:40.044665646 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 3369, last enqueued NCCL work: 3369, last completed NCCL work: 3368.
[rank2]:[E305 16:45:40.051277250 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3369, OpType=ALLTOALL, NumelIn=610560, NumelOut=610560, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
[rank2]:[E305 16:45:40.051536676 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 3369, last enqueued NCCL work: 3372, last completed NCCL work: 3368.
[rank3]: Traceback (most recent call last):[rank3]:   File "/home/xx/Works/VideoGen/FastVideo-0303/fastvideo/train.py", line 665, in <module>
[rank3]:     main(args)[rank3]:   File "/home/xx/Works/VideoGen/FastVideo-0303/fastvideo/train.py", line 368, in main
[rank3]:     loss, grad_norm = train_one_step([rank3]:   File "/home/xx/Works/VideoGen/FastVideo-0303/fastvideo/train.py", line 128, in train_one_step
[rank3]:     sigmas = get_sigmas([rank3]:   File "/home/xx/Works/VideoGen/FastVideo-0303/fastvideo/train.py", line 78, in get_sigmas
[rank3]:     step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps][rank3]:   File "/home/xx/Works/VideoGen/FastVideo-0303/fastvideo/train.py", line 78, in <listcomp>
[rank3]:     step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
[rank3]: RuntimeError: a Tensor with 0 elements cannot be converted to Scalar
[rank0]:[E305 16:45:41.086968448 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 0] Timeout at NCCL work: 3369, last enqueued NCCL work: 3372, last completed NCCL work: 3368.
[rank0]:[E305 16:45:41.086991975 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.
[rank0]:[E305 16:45:41.087001402 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E305 16:45:41.088377452 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=3369, OpType=ALLTOALL, NumelIn=610560, NumelOut=610560, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbd46b6bf86 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbd47e688d2 in /opt/conda/envs/hunyuan/lib/python3.1
0/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fbd47e6f313 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbd47e716fc in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)frame #4: <unknown function> + 0xdbbf4 (0x7fbd95616bf4 in /opt/conda/envs/hunyuan/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fbd96c10ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fbd96939b0d in /lib64/libc.so.6)
[rank1]:[E305 16:45:41.093941702 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 3369, last enqueued NCCL work: 3372, last completed NCCL work: 3368.
[rank1]:[E305 16:45:41.093961690 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E305 16:45:41.093969192 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E305 16:45:41.095218020 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3369, OpType=ALLTOALL, NumelIn=610560, NumelOut=610560, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f623bceff86 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f623cfec8d2 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f623cff3313 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f623cff56fc in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f628a79abf4 in /opt/conda/envs/hunyuan/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f628bd94ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f628babdb0d in /lib64/libc.so.6)

[rank3]:[E305 16:45:41.136804475 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 3] Timeout at NCCL work: 3369, last enqueued NCCL work: 3369, last completed NCCL work: 3368.
[rank3]:[E305 16:45:41.136826927 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E305 16:45:41.136835310 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E305 16:45:41.138044900 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3369, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600090 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f326dbaff86 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f326eeac8d2 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f326eeb3313 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f326eeb56fc in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f32bc65abf4 in /opt/conda/envs/hunyuan/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f32bdc54ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f32bd97db0d in /lib64/libc.so.6)

[rank2]:[E305 16:45:42.221061511 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 2] Timeout at NCCL work: 3369, last enqueued NCCL work: 3372, last completed NCCL work: 3368.
[rank2]:[E305 16:45:42.221085006 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E305 16:45:42.221092038 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E305 16:45:42.222303548 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3369, OpType=ALLTOALL, NumelIn=610560, NumelOut=610560, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fcec86e7f86 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fcec99e48d2 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fcec99eb313 in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fcec99ed6fc in /opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7fcf17192bf4 in /opt/conda/envs/hunyuan/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fcf1878cea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fcf184b5b0d in /lib64/libc.so.6)

W0305 16:47:44.728043 140358333335360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 60036 closing signal SIGTERM
W0305 16:47:44.728902 140358333335360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 60037 closing signal SIGTERM
W0305 16:47:44.729504 140358333335360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 60038 closing signal SIGTERM
E0305 16:47:54.182450 140358333335360 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 60035) of binary: /opt/conda/envs/hunyuan/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/hunyuan/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/hunyuan/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
fastvideo/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-05_16:47:44
  host      : xxx
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 60035)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 60035
============================================================

Reproduction

None

The text was updated successfully, but these errors were encountered:

gobigrassland · 2025-03-05T10:47:45Z

The program freezes regardless of whether num_frames=129 or num_frames=125. (How to set the value of num_frames, refer to #239)

CFS_DIR=/export/App/training_platform/PinoModel/cbfs/xx/FastVideo
torchrun --nnodes 1 --nproc_per_node 4 --master_port 29903 \
    fastvideo/train.py \
    --seed 1024 \
    --pretrained_model_name_or_path ${CFS_DIR}/data/hunyuan_diffusers \
    --model_type hunyuan_hf \
    --cache_dir ${CFS_DIR}/data/.cache \
    --data_json_path ${CFS_DIR}/data/Image-Vid-Finetune-HunYuan/videos2caption.json \
    --validation_prompt_dir ${CFS_DIR}/data/Image-Vid-Finetune-HunYuan/validation \
    --gradient_checkpointing \
    --train_batch_size 1 \
    --num_latent_t 32 \
    --sp_size 4 \
    --train_sp_batch_size 1 \
    --dataloader_num_workers 4 \
    --gradient_accumulation_steps 4 \
    --max_train_steps 6000 \
    --learning_rate 8e-5 \
    --mixed_precision bf16 \
    --checkpointing_steps 500 \
    --validation_steps 100 \
    --validation_sampling_steps 50 \
    --checkpoints_total_limit 3 \
    --allow_tf32 \
    --ema_start_step 0 \
    --cfg 0.0 \
    --ema_decay 0.999 \
    --log_validation \
    --output_dir ${CFS_DIR}/checkpoints/Hunyuan-lora-finetuning-Black-Myth-Wukong \
    --tracker_project_name Hunyuan-lora-finetuning-Black-Myth-Wukong \
    --num_frames 129 \
    --validation_guidance_scale "1.0" \
    --shift 7 \
    --use_lora \
    --lora_rank 32 \
    --lora_alpha 32

BrianChen1129 · 2025-03-05T21:24:59Z

We will try to figure out what happened. Could you share your environment for running this script?

gobigrassland · 2025-03-06T02:03:05Z

@BrianChen1129

H100 80G
pip packages:

absl-py==2.1.0
accelerate==1.0.1
aiofiles==23.2.1
aiohappyeyeballs==2.4.6
aiohttp==3.11.13
aiosignal==1.3.2
aniso8601==10.0.0
annotated-types==0.7.0
anyio==4.8.0
async-timeout==5.0.1
attrs==25.1.0
av==14.2.0
bitsandbytes==0.41.1
blessed==1.20.0
blinker==1.9.0
certifi==2025.1.31
charset-normalizer==3.4.1
click==8.1.8
codespell==2.3.0
decorator==4.4.2
decord==0.6.0
diffusers==0.32.0
docker-pycreds==0.4.0
einops==0.8.1
exceptiongroup==1.2.2
fastapi==0.115.11
-e git+https://github.com/hao-ai-lab/FastVideo.git@554ee17de54b95432edd4465a65e75d809b4564f#egg=fastvideo
ffmpy==0.5.0
filelock==3.17.0
flash_attn==2.7.0.post2
Flask==3.1.0
Flask-RESTful==0.3.10
frozenlist==1.5.0
fsspec==2025.2.0
future==1.0.0
gitdb==4.0.12
GitPython==3.1.44
gpustat==1.1.1
gradio==5.3.0
gradio_client==1.4.2
grpcio==1.70.0
h11==0.14.0
h5py==3.12.1
httpcore==1.0.7
httpx==0.28.1
huggingface-hub==0.26.1
idna==3.6
imageio==2.36.0
imageio-ffmpeg==0.5.1
importlib_metadata==8.6.1
isort==5.13.2
itsdangerous==2.2.0
Jinja2==3.1.5
liger_kernel==0.5.4
loguru==0.7.3
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
moviepy==1.0.3
mpmath==1.3.0
multidict==6.1.0
mypy==1.11.1
mypy-extensions==1.0.0
networkx==3.4.2
ninja==1.11.1.3
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
PyYAML==6.0.1
regex==2024.11.6
requests==2.32.3
rich==13.9.4
ruff==0.6.5
safetensors==0.5.3
scipy==1.14.1
semantic-version==2.10.0
sentencepiece==0.2.0
sentry-sdk==2.22.0
setproctitle==1.3.5
shellingham==1.5.4
six==1.16.0
smmap==5.0.2
sniffio==1.3.1
sphinx-lint==1.0.0
starlette==0.46.0
sympy==1.13.1
tensorboard==2.19.0
tensorboard-data-server==0.7.2
test_tube==0.7.5
timm==1.0.11
tokenizers==0.20.1
toml==0.10.2
tomli==2.0.2
tomlkit==0.12.0
torch==2.4.0
torchvision==0.19.0
tqdm==4.66.5
transformers==4.46.1
triton==3.0.0
typer==0.15.2
types-PyYAML==6.0.12.20241230
types-requests==2.32.0.20250301
types-setuptools==75.8.2.20250301
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
uvicorn==0.34.0
wandb==0.18.5
watch==0.2.7
wcwidth==0.2.13
websockets==12.0
Werkzeug==3.1.3
yapf==0.32.0
yarl==1.18.3
zipp==3.21.0

yinian-lw · 2025-03-10T03:41:36Z

same problem

gobigrassland mentioned this issue Mar 5, 2025

hunyuan lora微调nccl超时 #162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] fail to run finetune_hunyuan_hf_lora.sh #242

[Bug] fail to run finetune_hunyuan_hf_lora.sh #242

gobigrassland commented Mar 5, 2025

gobigrassland commented Mar 5, 2025

BrianChen1129 commented Mar 5, 2025

gobigrassland commented Mar 6, 2025

yinian-lw commented Mar 10, 2025

[Bug] fail to run finetune_hunyuan_hf_lora.sh #242

[Bug] fail to run finetune_hunyuan_hf_lora.sh #242

Comments

gobigrassland commented Mar 5, 2025

Environment

Describe the bug

Reproduction

gobigrassland commented Mar 5, 2025

BrianChen1129 commented Mar 5, 2025

gobigrassland commented Mar 6, 2025

yinian-lw commented Mar 10, 2025