Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
a-r-r-o-w committed Mar 5, 2025
1 parent b8d9a52 commit d27f1ec
Show file tree
Hide file tree
Showing 4 changed files with 31 additions and 439 deletions.
141 changes: 7 additions & 134 deletions docs/models/cogvideox.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,144 +4,17 @@

For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.

```bash
#!/bin/bash
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/path/to/dataset"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/path/to/models/cog/"
ID_TOKEN="BW_STYLE"

# Model arguments
model_cmd="--model_name cogvideox \
--pretrained_model_name_or_path THUDM/CogVideoX-5b"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
--video_column $VIDEO_COLUMN \
--caption_column $CAPTION_COLUMN \
--id_token $ID_TOKEN \
--video_resolution_buckets 49x480x720 \
--caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"

# Training arguments
training_cmd="--training_type lora \
--seed 42 \
--batch_size 1 \
--precompute_conditions \
--train_steps 1000 \
--rank 128 \
--lora_alpha 128 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 200 \
--checkpointing_limit 2 \
--resume_from_checkpoint=latest \
--enable_slicing \
--enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
--use_8bit_bnb \
--lr 3e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-cog \
--output_dir $OUTPUT_DIR \
--nccl_timeout 1800 \
--report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
$model_cmd \
$dataset_cmd \
$dataloader_cmd \
$training_cmd \
$optimizer_cmd \
$miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"
```

## Memory Usage

### LoRA
Examples available:
- [PIKA crush effect](../../examples/training/sft/cogvideox/crush_smol_lora/)

<!-- TODO(aryan): Update these numbers for 49x512x768 -->
To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):

> [!NOTE]
>
> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).
LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x480x720` resolutions, **with precomputation**:

```
Training configuration: {
"trainable parameters": 132120576,
"total samples": 69,
"train epochs": 1,
"train steps": 10,
"batches per device": 1,
"total batches observed per epoch": 69,
"train batch size": 1,
"gradient accumulation steps": 1
}
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------------:|:-----------------:|:-------------------:|
| after precomputing conditions | 8.880 | 8.941 |
| after precomputing latents | 9.300 | 12.441 |
| before training start | 10.622 | 20.701 |
| after epoch 1 | 11.145 | 20.701 |
| before validation start | 11.145 | 20.702 |
| after validation end | 11.145 | 28.324 |
| after training end | 11.144 | 11.592 |

### Full finetuning

```
Training configuration: {
"trainable parameters": 5570283072,
"total samples": 1,
"train epochs": 2,
"train steps": 2,
"batches per device": 1,
"total batches observed per epoch": 1,
"train batch size": 1,
"gradient accumulation steps": 1
}
```bash
chmod +x ./examples/training/sft/cogvideox/crush_smol_lora/train.sh
./examples/training/sft/cogvideox/crush_smol_lora/train.sh
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------------:|:-----------------:|:-------------------:|
| after precomputing conditions | 8.880 | 8.941 |
| after precomputing latents | 9.300 | 12.441 |
| before training start | 10.376 | 10.387 |
| after epoch 1 | 31.160 | 52.939 |
| before validation start | 31.161 | 52.939 |
| after validation end | 31.161 | 52.939 |
| after training end | 31.160 | 34.295 |
On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]

## Supported checkpoints

Expand Down
148 changes: 7 additions & 141 deletions docs/models/hunyuan_video.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,151 +4,17 @@

For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.

```bash
#!/bin/bash

export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/path/to/dataset"
CAPTION_COLUMN="prompts.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/path/to/models/hunyuan-video/"

ID_TOKEN="afkx"

# Model arguments
model_cmd="--model_name hunyuan_video \
--pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
--video_column $VIDEO_COLUMN \
--caption_column $CAPTION_COLUMN \
--id_token $ID_TOKEN \
--video_resolution_buckets 17x512x768 49x512x768 61x512x768 \
--caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 0"

# Diffusion arguments
diffusion_cmd=""

# Training arguments
training_cmd="--training_type lora \
--seed 42 \
--batch_size 1 \
--train_steps 500 \
--rank 128 \
--lora_alpha 128 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 500 \
--checkpointing_limit 2 \
--enable_slicing \
--enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
--lr 2e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
--output_dir $OUTPUT_DIR \
--nccl_timeout 1800 \
--report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids $GPU_IDS train.py \
$model_cmd \
$dataset_cmd \
$dataloader_cmd \
$diffusion_cmd \
$training_cmd \
$optimizer_cmd \
$miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"
```
Examples available:
- [PIKA Dissolve effect](../../examples/training/sft/hunyuan_video/modal_labs_dissolve/)

## Memory Usage
To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):

### LoRA

> [!NOTE]
>
> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).
LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **without precomputation**:

```
Training configuration: {
"trainable parameters": 163577856,
"total samples": 69,
"train epochs": 1,
"train steps": 10,
"batches per device": 1,
"total batches observed per epoch": 69,
"train batch size": 1,
"gradient accumulation steps": 1
}
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------:|:----------------:|:-------------------:|
| before training start | 38.889 | 39.020 |
| before validation start | 39.747 | 56.266 |
| after validation end | 39.748 | 58.385 |
| after epoch 1 | 39.748 | 40.910 |
| after training end | 25.288 | 40.910 |

Note: requires about `59` GB of VRAM when validation is performed.

LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **with precomputation**:

```
Training configuration: {
"trainable parameters": 163577856,
"total samples": 1,
"train epochs": 10,
"train steps": 10,
"batches per device": 1,
"total batches observed per epoch": 1,
"train batch size": 1,
"gradient accumulation steps": 1
}
```bash
chmod +x ./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------------:|:----------------:|:-------------------:|
| after precomputing conditions | 14.232 | 14.461 |
| after precomputing latents | 14.717 | 17.244 |
| before training start | 24.195 | 26.039 |
| after epoch 1 | 24.83 | 42.387 |
| before validation start | 24.842 | 42.387 |
| after validation end | 39.558 | 46.947 |
| after training end | 24.842 | 41.039 |

Note: requires about `47` GB of VRAM with validation. If validation is not performed, the memory usage is reduced to about `42` GB.

### Full finetuning

Current, full finetuning is not supported for HunyuanVideo. It goes out of memory (OOM) for `49x512x768` resolutions.
On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]

## Inference

Expand Down
Loading

0 comments on commit d27f1ec

Please sign in to comment.