Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory Error for cogvideox lora finetuning #273

Open
smktech9 opened this issue Mar 1, 2025 · 3 comments
Open

Out of Memory Error for cogvideox lora finetuning #273

smktech9 opened this issue Mar 1, 2025 · 3 comments

Comments

@smktech9
Copy link

smktech9 commented Mar 1, 2025

Even after using 2 gpus of 32 gb vram I am still getting out of memory error. It should run on 24 gb vram gpu. If anyone has run lora finetuning under 24 gb vram please help. I am attaching the screenshot of the error. I have already tried all optimizations mentioned on the repo.

Image

@a-r-r-o-w
Copy link
Owner

  • What pytorch version are you using?
  • Is gradient checkpointing enabled?
  • What size images/videos are you training with? And are they specified correctly with the image/video bucket parameters?

We've had many folks report succesful training of CogVideoX in under 16-24gb so I don't believe the issue is on our end. To help debug, I would first try training with a single video and overfitting it to first try and make sure the training runs

@smktech9
Copy link
Author

smktech9 commented Mar 2, 2025

#!/bin/bash
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0"

DATA_ROOT="data"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="output"
ID_TOKEN="BW_STYLE"

Model arguments

model_cmd="--model_name cogvideox
--pretrained_model_name_or_path THUDM/CogVideoX-5b"

Dataset arguments

dataset_cmd="--data_root $DATA_ROOT
--video_column $VIDEO_COLUMN
--caption_column $CAPTION_COLUMN
--id_token $ID_TOKEN
--video_resolution_buckets 49x240x360
--caption_dropout_p 0.05"

Dataloader arguments

dataloader_cmd="--dataloader_num_workers 4"

Training arguments

training_cmd="--training_type lora
--seed 42
--batch_size 1
--precompute_conditions
--train_steps 1
--rank 64
--lora_alpha 64
--target_modules to_q to_k to_v to_out.0
--gradient_accumulation_steps 1
--gradient_checkpointing
--checkpointing_steps 1
--checkpointing_limit 1
--resume_from_checkpoint=latest
--enable_slicing
--enable_tiling
--layerwise_upcasting_modules transformer
--transformer_dtype bf16"

Optimizer arguments

optimizer_cmd="--optimizer adamw
--use_8bit_bnb
--lr 3e-5
--lr_scheduler constant_with_warmup
--lr_warmup_steps 100
--lr_num_cycles 1
--beta1 0.9
--beta2 0.95
--weight_decay 1e-4
--epsilon 1e-8
--max_grad_norm 1.0"

Miscellaneous arguments

miscellaneous_cmd="--tracker_name finetrainers-cog
--output_dir $OUTPUT_DIR
--nccl_timeout 1800
--report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py
$model_cmd
$dataset_cmd
$dataloader_cmd
$training_cmd
$optimizer_cmd
$miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Deepspeed.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Is there something I should change in these? I am using single video only

@smktech9
Copy link
Author

smktech9 commented Mar 2, 2025

I am using pytorch version 2.6.0+cu124
My gpu consumption for ltx video is also 24 gb to complete training while it should be under 8 gb. What is increasing my consumption?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants