Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs #294

Merged
merged 1 commit into from
Mar 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 7 additions & 134 deletions docs/models/cogvideox.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,144 +4,17 @@

For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.

```bash
#!/bin/bash
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/path/to/dataset"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/path/to/models/cog/"
ID_TOKEN="BW_STYLE"

# Model arguments
model_cmd="--model_name cogvideox \
--pretrained_model_name_or_path THUDM/CogVideoX-5b"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
--video_column $VIDEO_COLUMN \
--caption_column $CAPTION_COLUMN \
--id_token $ID_TOKEN \
--video_resolution_buckets 49x480x720 \
--caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"

# Training arguments
training_cmd="--training_type lora \
--seed 42 \
--batch_size 1 \
--precompute_conditions \
--train_steps 1000 \
--rank 128 \
--lora_alpha 128 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 200 \
--checkpointing_limit 2 \
--resume_from_checkpoint=latest \
--enable_slicing \
--enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
--use_8bit_bnb \
--lr 3e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-cog \
--output_dir $OUTPUT_DIR \
--nccl_timeout 1800 \
--report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
$model_cmd \
$dataset_cmd \
$dataloader_cmd \
$training_cmd \
$optimizer_cmd \
$miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"
```

## Memory Usage

### LoRA
Examples available:
- [PIKA crush effect](../../examples/training/sft/cogvideox/crush_smol_lora/)

<!-- TODO(aryan): Update these numbers for 49x512x768 -->
To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):

> [!NOTE]
>
> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).

LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x480x720` resolutions, **with precomputation**:

```
Training configuration: {
"trainable parameters": 132120576,
"total samples": 69,
"train epochs": 1,
"train steps": 10,
"batches per device": 1,
"total batches observed per epoch": 69,
"train batch size": 1,
"gradient accumulation steps": 1
}
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------------:|:-----------------:|:-------------------:|
| after precomputing conditions | 8.880 | 8.941 |
| after precomputing latents | 9.300 | 12.441 |
| before training start | 10.622 | 20.701 |
| after epoch 1 | 11.145 | 20.701 |
| before validation start | 11.145 | 20.702 |
| after validation end | 11.145 | 28.324 |
| after training end | 11.144 | 11.592 |

### Full finetuning

```
Training configuration: {
"trainable parameters": 5570283072,
"total samples": 1,
"train epochs": 2,
"train steps": 2,
"batches per device": 1,
"total batches observed per epoch": 1,
"train batch size": 1,
"gradient accumulation steps": 1
}
```bash
chmod +x ./examples/training/sft/cogvideox/crush_smol_lora/train.sh
./examples/training/sft/cogvideox/crush_smol_lora/train.sh
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------------:|:-----------------:|:-------------------:|
| after precomputing conditions | 8.880 | 8.941 |
| after precomputing latents | 9.300 | 12.441 |
| before training start | 10.376 | 10.387 |
| after epoch 1 | 31.160 | 52.939 |
| before validation start | 31.161 | 52.939 |
| after validation end | 31.161 | 52.939 |
| after training end | 31.160 | 34.295 |
On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]

## Supported checkpoints

Expand Down
148 changes: 7 additions & 141 deletions docs/models/hunyuan_video.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,151 +4,17 @@

For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.

```bash
#!/bin/bash

export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/path/to/dataset"
CAPTION_COLUMN="prompts.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/path/to/models/hunyuan-video/"

ID_TOKEN="afkx"

# Model arguments
model_cmd="--model_name hunyuan_video \
--pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
--video_column $VIDEO_COLUMN \
--caption_column $CAPTION_COLUMN \
--id_token $ID_TOKEN \
--video_resolution_buckets 17x512x768 49x512x768 61x512x768 \
--caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 0"

# Diffusion arguments
diffusion_cmd=""

# Training arguments
training_cmd="--training_type lora \
--seed 42 \
--batch_size 1 \
--train_steps 500 \
--rank 128 \
--lora_alpha 128 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 500 \
--checkpointing_limit 2 \
--enable_slicing \
--enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
--lr 2e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
--output_dir $OUTPUT_DIR \
--nccl_timeout 1800 \
--report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids $GPU_IDS train.py \
$model_cmd \
$dataset_cmd \
$dataloader_cmd \
$diffusion_cmd \
$training_cmd \
$optimizer_cmd \
$miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"
```
Examples available:
- [PIKA Dissolve effect](../../examples/training/sft/hunyuan_video/modal_labs_dissolve/)

## Memory Usage
To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):

### LoRA

> [!NOTE]
>
> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).

LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **without precomputation**:

```
Training configuration: {
"trainable parameters": 163577856,
"total samples": 69,
"train epochs": 1,
"train steps": 10,
"batches per device": 1,
"total batches observed per epoch": 69,
"train batch size": 1,
"gradient accumulation steps": 1
}
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------:|:----------------:|:-------------------:|
| before training start | 38.889 | 39.020 |
| before validation start | 39.747 | 56.266 |
| after validation end | 39.748 | 58.385 |
| after epoch 1 | 39.748 | 40.910 |
| after training end | 25.288 | 40.910 |

Note: requires about `59` GB of VRAM when validation is performed.

LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **with precomputation**:

```
Training configuration: {
"trainable parameters": 163577856,
"total samples": 1,
"train epochs": 10,
"train steps": 10,
"batches per device": 1,
"total batches observed per epoch": 1,
"train batch size": 1,
"gradient accumulation steps": 1
}
```bash
chmod +x ./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
```

| stage | memory_allocated | max_memory_reserved |
|:-----------------------------:|:----------------:|:-------------------:|
| after precomputing conditions | 14.232 | 14.461 |
| after precomputing latents | 14.717 | 17.244 |
| before training start | 24.195 | 26.039 |
| after epoch 1 | 24.83 | 42.387 |
| before validation start | 24.842 | 42.387 |
| after validation end | 39.558 | 46.947 |
| after training end | 24.842 | 41.039 |

Note: requires about `47` GB of VRAM with validation. If validation is not performed, the memory usage is reduced to about `42` GB.

### Full finetuning

Current, full finetuning is not supported for HunyuanVideo. It goes out of memory (OOM) for `49x512x768` resolutions.
On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]

## Inference

Expand Down
Loading