Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow images; Remove LLM generated prefixes; Allow JSON/JSONL; Fix bugs #158

Merged
merged 16 commits into from
Jan 6, 2025

Conversation

a-r-r-o-w
Copy link
Owner

No description provided.

@a-r-r-o-w a-r-r-o-w requested a review from sayakpaul December 27, 2024 20:57
# - Using a CSV: caption_column and video_column must be some column in the CSV. One could
# make use of other columns too, such as a motion score or aesthetic score, by modifying the
# logic in CSV processing.
# - Using two files containing line-separate captions and relative paths to videos.
# - Using a JSON file containing a list of dictionaries, where each dictionary has a `caption_column` and `video_column` key.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this?

[{"prompt": ..., "video": ...}, ...]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. An example dataset would be this: https://huggingface.co/datasets/omni-research/DREAM-1K

Comment on lines 86 to 91
# Clean LLM start phrases
for i in range(len(self.prompts)):
self.prompts[i] = self.prompts[i].strip()
for phrase in COMMON_LLM_START_PHRASES:
if self.prompts[i].startswith(phrase):
self.prompts[i] = self.prompts[i].removeprefix(phrase).strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be user-configured?

And maybe we should also note this from our data-prep guide?

Copy link
Collaborator

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@a-r-r-o-w a-r-r-o-w changed the title Remove LLM generated prefixes and allowing loading from JSON Allow images to be loaded; Remove LLM generated prefixes; Allow loading from JSON Dec 30, 2024
@a-r-r-o-w a-r-r-o-w changed the title Allow images to be loaded; Remove LLM generated prefixes; Allow loading from JSON Allow images; Remove LLM generated prefixes; Allow JSON/JSONL; Fix bugs Dec 31, 2024
@a-r-r-o-w
Copy link
Owner Author

After the HunyuanVideo fixes, the precomputed vs non-precomputed runs match almost exactly when starting with the same parameters. The weights converge to very similar values and the validation videos demonstrate this as well: https://api.wandb.ai/links/aryanvs/3aixk4xk

@sayakpaul
Copy link
Collaborator

Is this ready for another review?

@a-r-r-o-w
Copy link
Owner Author

No.

@sayakpaul sayakpaul mentioned this pull request Jan 4, 2025
@a-r-r-o-w
Copy link
Owner Author

Since this makes some changes to the README, would prefer to merge after #175. Will move the dataset.md file into the docs/ folder too here, because currently it's in assets/

@sayakpaul
Copy link
Collaborator

SG!

LMK if you would like me to review, too.

@@ -955,7 +956,9 @@ def validate(self, step: int, final_validation: bool = False) -> None:
width=width,
num_frames=num_frames,
num_videos_per_prompt=self.args.num_validation_videos_per_prompt,
generator=self.state.generator,
generator=torch.Generator(device=accelerator.device).manual_seed(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, if we use the state, we get different validation images/videos every time. Not very indicative of training working so we want to ensure each generation starts with the same seed

revision: Optional[str] = None,
cache_dir: Optional[str] = None,
**kwargs,
) -> Dict[str, Union[nn.Module, FlowMatchEulerDiscreteScheduler]]:
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id, subfolder="transformer", torch_dtype=transformer_dtype, revision=revision, cache_dir=cache_dir
)
scheduler = FlowMatchEulerDiscreteScheduler()
scheduler = FlowMatchEulerDiscreteScheduler(shift=shift)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HunyuanVideo uses 7.0 as the flow shift for inference. By default this value is 1.0, which corresponds to the original flow matching objective, but there has been reports of success and even better results with varying values of shift, so I think makes sense to support



logger = get_logger(__name__)


class VideoDataset(Dataset):
# TODO(aryan): This needs a refactor with separation of concerns.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset part needs a complete rewrite. Will take it up in follow-up PRs

},
"dataloader_arguments": {
"dataloader_num_workers": self.dataloader_num_workers,
"pin_memory": self.pin_memory,
},
"diffusion_arguments": {
"flow_resolution_shifting": self.flow_resolution_shifting,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't had sucess with flow_resolution_shifting (needs to be added for LTX I believe), so we still don't handle the case of adjusting sigmas when this is specified. Will add the actual logic that makes use of it in the image-to-video PR after further iterations

@a-r-r-o-w a-r-r-o-w requested a review from sayakpaul January 5, 2025 16:23
@a-r-r-o-w
Copy link
Owner Author

@sayakpaul Ready for another review. Doing a small run to check if the validation generator changes work as expected

artifact_type = value["type"]
artifact_value = value["value"]
if artifact_type not in ["image", "video"] or artifact_value is None:
continue

extension = "png" if artifact_type == "image" else "mp4"
filename = "validation-" if not final_validation else "final-"
filename += f"{step}-{accelerator.process_index}-{prompt_filename}.{extension}"
filename += f"{step}-{accelerator.process_index}-{index}-{prompt_filename}.{extension}"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just caught my eye. If we use the same prompt multiple times, but want to validate at different resolutions, it might end up using the same filename I think. This should be safer imo

@a-r-r-o-w a-r-r-o-w force-pushed the dataset-improvements branch from 87b848d to 33a8f6b Compare January 5, 2025 21:24
@a-r-r-o-w
Copy link
Owner Author

I can confirm that the validation generator changes work as expected. Here's demo video using the same starting generator across different training steps:

output.mp4

@a-r-r-o-w
Copy link
Owner Author

@sayakpaul Proceeding with merge here as two folks who've helped with initial feedback wanted to try this with FP8 training. Currently, creating a common branch with these changes and the one for FP8 creates a merge conflict so I will resolve it for them in the FP8 branch. If you have any suggestions on things to improve here, happy to iterate in future PR

@a-r-r-o-w a-r-r-o-w merged commit 38413aa into main Jan 6, 2025
1 check passed
@a-r-r-o-w a-r-r-o-w deleted the dataset-improvements branch January 6, 2025 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants