update docs

a-r-r-o-w · Mar 5, 2025 · d27f1ec · d27f1ec
1 parent b8d9a52
commit d27f1ec
Show file tree

Hide file tree

Showing 4 changed files with 31 additions and 439 deletions.
diff --git a/docs/models/cogvideox.md b/docs/models/cogvideox.md
@@ -4,144 +4,17 @@
 
 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
 
-```bash
-#!/bin/bash
-export WANDB_MODE="offline"
-export NCCL_P2P_DISABLE=1
-export TORCH_NCCL_ENABLE_MONITORING=0
-export FINETRAINERS_LOG_LEVEL=DEBUG
-
-GPU_IDS="0,1"
-
-DATA_ROOT="/path/to/dataset"
-CAPTION_COLUMN="prompt.txt"
-VIDEO_COLUMN="videos.txt"
-OUTPUT_DIR="/path/to/models/cog/"
-ID_TOKEN="BW_STYLE"
-
-# Model arguments
-model_cmd="--model_name cogvideox \
-  --pretrained_model_name_or_path THUDM/CogVideoX-5b"
-
-# Dataset arguments
-dataset_cmd="--data_root $DATA_ROOT \
-  --video_column $VIDEO_COLUMN \
-  --caption_column $CAPTION_COLUMN \
-  --id_token $ID_TOKEN \
-  --video_resolution_buckets 49x480x720 \
-  --caption_dropout_p 0.05"
-
-# Dataloader arguments
-dataloader_cmd="--dataloader_num_workers 4"
-
-# Training arguments
-training_cmd="--training_type lora \
-  --seed 42 \
-  --batch_size 1 \
-  --precompute_conditions \
-  --train_steps 1000 \
-  --rank 128 \
-  --lora_alpha 128 \
-  --target_modules to_q to_k to_v to_out.0 \
-  --gradient_accumulation_steps 1 \
-  --gradient_checkpointing \
-  --checkpointing_steps 200 \
-  --checkpointing_limit 2 \
-  --resume_from_checkpoint=latest \
-  --enable_slicing \
-  --enable_tiling"
-
-# Optimizer arguments
-optimizer_cmd="--optimizer adamw \
-  --use_8bit_bnb \
-  --lr 3e-5 \
-  --lr_scheduler constant_with_warmup \
-  --lr_warmup_steps 100 \
-  --lr_num_cycles 1 \
-  --beta1 0.9 \
-  --beta2 0.95 \
-  --weight_decay 1e-4 \
-  --epsilon 1e-8 \
-  --max_grad_norm 1.0"
-
-# Miscellaneous arguments
-miscellaneous_cmd="--tracker_name finetrainers-cog \
-  --output_dir $OUTPUT_DIR \
-  --nccl_timeout 1800 \
-  --report_to wandb"
-
-cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS train.py \
-  $model_cmd \
-  $dataset_cmd \
-  $dataloader_cmd \
-  $training_cmd \
-  $optimizer_cmd \
-  $miscellaneous_cmd"
-
-echo "Running command: $cmd"
-eval $cmd
-echo -ne "-------------------- Finished executing script --------------------\n\n"
-```
-
-## Memory Usage
-
-### LoRA
+Examples available:
+- [PIKA crush effect](../../examples/training/sft/cogvideox/crush_smol_lora/)
 
-<!-- TODO(aryan): Update these numbers for 49x512x768 -->
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
 
-> [!NOTE]
->
-> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).
-
-LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x480x720` resolutions, **with precomputation**:
-
-```
-Training configuration: {
-    "trainable parameters": 132120576,
-    "total samples": 69,
-    "train epochs": 1,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 69,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
-```
-
-| stage                         | memory_allocated  | max_memory_reserved |
-|:-----------------------------:|:-----------------:|:-------------------:|
-| after precomputing conditions |  8.880            | 8.941               |
-| after precomputing latents    |  9.300            | 12.441              |
-| before training start         | 10.622            | 20.701              |
-| after epoch 1                 | 11.145            | 20.701              |
-| before validation start       | 11.145            | 20.702              |
-| after validation end          | 11.145            | 28.324              |
-| after training end            | 11.144            | 11.592              |
-
-### Full finetuning
-
-```
-Training configuration: {
-    "trainable parameters": 5570283072,
-    "total samples": 1,
-    "train epochs": 2,
-    "train steps": 2,
-    "batches per device": 1,
-    "total batches observed per epoch": 1,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
+```bash
+chmod +x ./examples/training/sft/cogvideox/crush_smol_lora/train.sh
+./examples/training/sft/cogvideox/crush_smol_lora/train.sh
 ```
 
-| stage                         | memory_allocated  | max_memory_reserved |
-|:-----------------------------:|:-----------------:|:-------------------:|
-| after precomputing conditions |  8.880            | 8.941               |
-| after precomputing latents    |  9.300            | 12.441              |
-| before training start         | 10.376            | 10.387              |
-| after epoch 1                 | 31.160            | 52.939              |
-| before validation start       | 31.161            | 52.939              |
-| after validation end          | 31.161            | 52.939              |
-| after training end            | 31.160            | 34.295              |
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
 
 ## Supported checkpoints
 

diff --git a/docs/models/hunyuan_video.md b/docs/models/hunyuan_video.md
@@ -4,151 +4,17 @@
 
 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
 
-```bash
-#!/bin/bash
-
-export WANDB_MODE="offline"
-export NCCL_P2P_DISABLE=1
-export TORCH_NCCL_ENABLE_MONITORING=0
-export FINETRAINERS_LOG_LEVEL=DEBUG
-
-GPU_IDS="0,1"
-
-DATA_ROOT="/path/to/dataset"
-CAPTION_COLUMN="prompts.txt"
-VIDEO_COLUMN="videos.txt"
-OUTPUT_DIR="/path/to/models/hunyuan-video/"
-
-ID_TOKEN="afkx"
-
-# Model arguments
-model_cmd="--model_name hunyuan_video \
-  --pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"
-
-# Dataset arguments
-dataset_cmd="--data_root $DATA_ROOT \
-  --video_column $VIDEO_COLUMN \
-  --caption_column $CAPTION_COLUMN \
-  --id_token $ID_TOKEN \
-  --video_resolution_buckets 17x512x768 49x512x768 61x512x768 \
-  --caption_dropout_p 0.05"
-
-# Dataloader arguments
-dataloader_cmd="--dataloader_num_workers 0"
-
-# Diffusion arguments
-diffusion_cmd=""
-
-# Training arguments
-training_cmd="--training_type lora \
-  --seed 42 \
-  --batch_size 1 \
-  --train_steps 500 \
-  --rank 128 \
-  --lora_alpha 128 \
-  --target_modules to_q to_k to_v to_out.0 \
-  --gradient_accumulation_steps 1 \
-  --gradient_checkpointing \
-  --checkpointing_steps 500 \
-  --checkpointing_limit 2 \
-  --enable_slicing \
-  --enable_tiling"
-
-# Optimizer arguments
-optimizer_cmd="--optimizer adamw \
-  --lr 2e-5 \
-  --lr_scheduler constant_with_warmup \
-  --lr_warmup_steps 100 \
-  --lr_num_cycles 1 \
-  --beta1 0.9 \
-  --beta2 0.95 \
-  --weight_decay 1e-4 \
-  --epsilon 1e-8 \
-  --max_grad_norm 1.0"
-
-# Miscellaneous arguments
-miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
-  --output_dir $OUTPUT_DIR \
-  --nccl_timeout 1800 \
-  --report_to wandb"
-
-cmd="accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids $GPU_IDS train.py \
-  $model_cmd \
-  $dataset_cmd \
-  $dataloader_cmd \
-  $diffusion_cmd \
-  $training_cmd \
-  $optimizer_cmd \
-  $miscellaneous_cmd"
-
-echo "Running command: $cmd"
-eval $cmd
-echo -ne "-------------------- Finished executing script --------------------\n\n"
-```
+Examples available:
+- [PIKA Dissolve effect](../../examples/training/sft/hunyuan_video/modal_labs_dissolve/)
 
-## Memory Usage
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
 
-### LoRA
-
-> [!NOTE]
->
-> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).
-
-LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **without precomputation**:
-
-```
-Training configuration: {
-    "trainable parameters": 163577856,
-    "total samples": 69,
-    "train epochs": 1,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 69,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
-```
-
-| stage                   | memory_allocated | max_memory_reserved |
-|:-----------------------:|:----------------:|:-------------------:|
-| before training start   | 38.889           | 39.020              |
-| before validation start | 39.747           | 56.266              |
-| after validation end    | 39.748           | 58.385              |
-| after epoch 1           | 39.748           | 40.910              |
-| after training end      | 25.288           | 40.910              |
-
-Note: requires about `59` GB of VRAM when validation is performed.
-
-LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **with precomputation**:
-
-```
-Training configuration: {
-    "trainable parameters": 163577856,
-    "total samples": 1,
-    "train epochs": 10,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 1,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
+```bash
+chmod +x ./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
+./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
 ```
 
-| stage                         | memory_allocated | max_memory_reserved |
-|:-----------------------------:|:----------------:|:-------------------:|
-| after precomputing conditions | 14.232           | 14.461              |
-| after precomputing latents    | 14.717           | 17.244              |
-| before training start         | 24.195           | 26.039              |
-| after epoch 1                 | 24.83            | 42.387              |
-| before validation start       | 24.842           | 42.387              |
-| after validation end          | 39.558           | 46.947              |
-| after training end            | 24.842           | 41.039              |
-
-Note: requires about `47` GB of VRAM with validation. If validation is not performed, the memory usage is reduced to about `42` GB.
-
-### Full finetuning
-
-Current, full finetuning is not supported for HunyuanVideo. It goes out of memory (OOM) for `49x512x768` resolutions.
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
 
 ## Inference