Out of Memory Error for cogvideox lora finetuning #273

smktech9 · 2025-03-01T04:00:08Z

Even after using 2 gpus of 32 gb vram I am still getting out of memory error. It should run on 24 gb vram gpu. If anyone has run lora finetuning under 24 gb vram please help. I am attaching the screenshot of the error. I have already tried all optimizations mentioned on the repo.

a-r-r-o-w · 2025-03-02T14:03:46Z

What pytorch version are you using?
Is gradient checkpointing enabled?
What size images/videos are you training with? And are they specified correctly with the image/video bucket parameters?

We've had many folks report succesful training of CogVideoX in under 16-24gb so I don't believe the issue is on our end. To help debug, I would first try training with a single video and overfitting it to first try and make sure the training runs

smktech9 · 2025-03-02T14:13:47Z

#!/bin/bash
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0"

DATA_ROOT="data"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="output"
ID_TOKEN="BW_STYLE"

Model arguments

model_cmd="--model_name cogvideox
--pretrained_model_name_or_path THUDM/CogVideoX-5b"

Dataset arguments

dataset_cmd="--data_root $DATA_ROOT
--video_column $VIDEO_COLUMN
--caption_column $CAPTION_COLUMN
--id_token $ID_TOKEN
--video_resolution_buckets 49x240x360
--caption_dropout_p 0.05"

Dataloader arguments

dataloader_cmd="--dataloader_num_workers 4"

Training arguments

training_cmd="--training_type lora
--seed 42
--batch_size 1
--precompute_conditions
--train_steps 1
--rank 64
--lora_alpha 64
--target_modules to_q to_k to_v to_out.0
--gradient_accumulation_steps 1
--gradient_checkpointing
--checkpointing_steps 1
--checkpointing_limit 1
--resume_from_checkpoint=latest
--enable_slicing
--enable_tiling
--layerwise_upcasting_modules transformer
--transformer_dtype bf16"

Optimizer arguments

optimizer_cmd="--optimizer adamw
--use_8bit_bnb
--lr 3e-5
--lr_scheduler constant_with_warmup
--lr_warmup_steps 100
--lr_num_cycles 1
--beta1 0.9
--beta2 0.95
--weight_decay 1e-4
--epsilon 1e-8
--max_grad_norm 1.0"

Miscellaneous arguments

miscellaneous_cmd="--tracker_name finetrainers-cog
--output_dir $OUTPUT_DIR
--nccl_timeout 1800
--report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py
$model_cmd
$dataset_cmd
$dataloader_cmd
$training_cmd
$optimizer_cmd
$miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Deepspeed.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Is there something I should change in these? I am using single video only

smktech9 · 2025-03-02T16:00:55Z

I am using pytorch version 2.6.0+cu124
My gpu consumption for ltx video is also 24 gb to complete training while it should be under 8 gb. What is increasing my consumption?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory Error for cogvideox lora finetuning #273

Out of Memory Error for cogvideox lora finetuning #273

smktech9 commented Mar 1, 2025

a-r-r-o-w commented Mar 2, 2025

smktech9 commented Mar 2, 2025

smktech9 commented Mar 2, 2025

Out of Memory Error for cogvideox lora finetuning #273

Out of Memory Error for cogvideox lora finetuning #273

Comments

smktech9 commented Mar 1, 2025

a-r-r-o-w commented Mar 2, 2025

smktech9 commented Mar 2, 2025

Model arguments

Dataset arguments

Dataloader arguments

Training arguments

Optimizer arguments

Miscellaneous arguments

smktech9 commented Mar 2, 2025