CogVideoX lora finetuning error #269

B-Soul · 2025-02-24T05:11:32Z

System Info / 系統信息

I use my script to finetune CogVideoX-2B on a image dataset. But it reports the shape error.

File "/home/gaorundong/anaconda3/envs/finetrainers/lib/python3.10/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 476, in forward
image_embeds = self.proj(image_embeds) [0/340]
File "/home/gaorundong/anaconda3/envs/finetrainers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/gaorundong/anaconda3/envs/finetrainers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/gaorundong/anaconda3/envs/finetrainers/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 458, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/gaorundong/anaconda3/envs/finetrainers/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [1920, 16, 2, 2], expected input[16, 1, 60, 90] to have 16 channels, but got 1 channels instead

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

#!/bin/bash
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0"

DATA_ROOT="/home/gaorundong/MoveIt/data/flux-retrostyle-dataset-mini"
CAPTION_COLUMN="prompts.txt"
VIDEO_COLUMN="images.txt"
OUTPUT_DIR="/home/gaorundong/MoveIt/output/CogVideo-2B-lora/"

Model arguments

model_cmd="--model_name cogvideox
--pretrained_model_name_or_path /home/lishicheng/ckpts/THUDM/CogVideoX-2B"

Dataset arguments

dataset_cmd="--data_root $DATA_ROOT
--video_column $VIDEO_COLUMN
--caption_column $CAPTION_COLUMN
--video_resolution_buckets 1x480x720
--caption_dropout_p 0.05"

Dataloader arguments

dataloader_cmd="--dataloader_num_workers 4"

Training arguments

training_cmd="--training_type lora
--seed 42
--batch_size 1
--train_steps 1000
--rank 128
--lora_alpha 128
--target_modules to_q to_k to_v
--gradient_accumulation_steps 1
--gradient_checkpointing
--checkpointing_steps 200
--checkpointing_limit 2
--resume_from_checkpoint=latest
--enable_slicing
--enable_tiling"

Optimizer arguments

optimizer_cmd="--optimizer adamw
--use_8bit_bnb
--lr 3e-5
--lr_scheduler constant_with_warmup
--lr_warmup_steps 100
--lr_num_cycles 1
--beta1 0.9
--beta2 0.95
--weight_decay 0.0
--epsilon 1e-8
--max_grad_norm 1.0"

Miscellaneous arguments

miscellaneous_cmd="--tracker_name finetrainers-cogvideox-2b
--output_dir $OUTPUT_DIR
--nccl_timeout 1800
--report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/uncompiled_1.yaml --gpu_ids $GPU_IDS train.py
$model_cmd
$dataset_cmd
$dataloader_cmd
$training_cmd
$optimizer_cmd
$miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Expected behavior / 期待表现

finish training

a-r-r-o-w · 2025-02-24T05:34:08Z

Just trying to verify if this is not an environment-related error: are you able to run inference for CogVideoX-2b without errors?

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16).to("cuda")
prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)

B-Soul · 2025-02-24T06:04:57Z

Yes, I can successfully inference for CogVideoX-2B. Besides, I can successfully train for LTXVideo.

a-r-r-o-w · 2025-02-24T06:10:56Z

Thanks for the confirmation! I will take a look and try to debug Cog-2B. IIRC, we did not test it when adding and only checked Cog-5b and Cog-1.5-5b

FlyingCan · 2025-03-09T13:53:33Z

An error occurred during training: 'FrozenDict' object has no attribute 'invert_scale_latents', when finetuning the cogvideo T2V(2B and 5b) if not vae.config.invert_scale_latents:
latents = latents * vae.config.scaling_factor

a-r-r-o-w · 2025-03-09T21:42:07Z

@FlyingCan Fixed in #309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CogVideoX lora finetuning error #269

CogVideoX lora finetuning error #269

B-Soul commented Feb 24, 2025

a-r-r-o-w commented Feb 24, 2025

B-Soul commented Feb 24, 2025

a-r-r-o-w commented Feb 24, 2025

FlyingCan commented Mar 9, 2025

a-r-r-o-w commented Mar 9, 2025

CogVideoX lora finetuning error #269

CogVideoX lora finetuning error #269

Comments

B-Soul commented Feb 24, 2025

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Model arguments

Dataset arguments

Dataloader arguments

Training arguments

Optimizer arguments

Miscellaneous arguments

Expected behavior / 期待表现

a-r-r-o-w commented Feb 24, 2025

B-Soul commented Feb 24, 2025

a-r-r-o-w commented Feb 24, 2025

FlyingCan commented Mar 9, 2025

a-r-r-o-w commented Mar 9, 2025