Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[do not merge] Rebased refactor #270

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from
Draft

[do not merge] Rebased refactor #270

wants to merge 41 commits into from

Conversation

jzhang38
Copy link
Collaborator

@jzhang38 jzhang38 commented Mar 16, 2025

[ ] Attn backend (PY)
[ ] Wan text encoder (PY)
[ ] wan vae (Wei)
[ ] wan pipeline (PY & Wei)
[ ] Merge wan dit code to refactor (Will)
[ ] Clean up code & loader directory(Will)
[ ] hunyuan text encoder (Will)

SolitaryThinker and others added 30 commits March 14, 2025 19:07
Signed-off-by: <>
Co-authored-by: Will Lin <[email protected]>
Co-authored-by: Ubuntu <ubuntu@awesome-gpu-name-8-inst-2tbsnfodvpomxv4tukw2dkfgyvz.c.nv-brev-20240723.internal>
Co-authored-by: Ubuntu <ubuntu@awesome-gpu-name-9-inst-2tpydiudxfu1jg9xvpflm7oexie.c.nv-brev-20240723.internal>
This was referenced Mar 16, 2025
@jzhang38 jzhang38 changed the title Rebased refactor [do not merge] Rebased refactor Mar 16, 2025
@jzhang38 jzhang38 marked this pull request as draft March 16, 2025 23:50
Comment on lines +1 to +18
# Copyright 2024 Stability AI, Katherine Crowson and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
#
# Modified from diffusers==0.29.2
#
# ==============================================================================

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for licensing:

  1. All files need Apache license headers. It is recommended to use a simplified one-liner:
# SPDX-License-Identifier: Apache-2.0
  1. For the files copied from other projects, if that project is in Apache-2.0 as well (e.g. vLLM/SGLang), then just
# SPDX-License-Identifier: Apache-2.0
# Adapted from: <link to the original file>
  1. For the files copied from other projects, if that project is in other licenses (e.g. Copyright), you need to read their copyright and make sure you can copy it. If so, then the header could be
# Copyright 2024 Stability AI, Katherine Crowson and The HuggingFace Team. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Adapted from: <link to the original file>

One way to eliminate the full license in every files you copied, you could have license files for each project, such as https://github.com/awslabs/raf/tree/main/licenses

Comment on lines 120 to 121
modules["text_encoder"] = text_encoder
modules["text_encoder_2"] = text_encoder_2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are those modules used and how these keys (e.g., text_encoder_2) are determined? A more general question is if a model provider wants to implement a pipeline, does it trivial for them to know how should they implement this function?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question I think as my above one

self.add_stage("decoding_stage",
DecodingStage())

def initialize_pipeline(self, inference_args: InferenceArgs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. It seems not clear about what to initialize in this function.

Comment on lines 170 to 171
# for stage in self._stages:
# batch = stage(batch, inference_args)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely prefer this programming model; otherwise it's meaningless to have stages.

If users want to custom logic between stages, you should ask them to implement a custom stage (so ofc the API of wrapping custom logic to be a pipeline stage has to be clean and simple), and add that stage to the pipeline; otherwise it's hard to apply pipeline-level optimizations because you may not have the fully picture.

Returns:
The updated batch information after this stage's processing.
"""
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should raise NotImplementedError?
  2. Also the name _call_implementation seems not clear and concise enough. Maybe just call() (it's better to make this a public API without the underline prefix)?

Comment on lines 91 to 92
# Just call the implementation directly if logging is disabled
return self._call_implementation(batch, inference_args)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better user experience it's better to validate input and output batch (e.g., check expected keys) before and after calling the implementation.

prompt_embeds_2 = batch.prompt_embeds_2

# Run denoising loop
with self.progress_bar(total=num_inference_steps) as progress_bar:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not handle display stuffs in this function.

# TODO(will): finalize the interface
with set_forward_context(num_step=i, inference_args=inference_args):
# Run transformer
noise_pred = self.transformer(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is confusing, because it means this stage cannot be used without pipeline. Ideally every stage should be self-contained. For example

def __init__(self, ...):
    self.transformers = ...

def _call_implementation(self, ...):
    batch = self.transformers(...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inference_hunyuan_v1 instead of v1_inference_hunyuan?

predict.py Outdated
@@ -12,7 +12,7 @@
import torchvision
from cog import BasePredictor, Input, Path
from einops import rearrange

from transformers import LlamaForCausalLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you indeed need a space to make import three sections?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run linting and code style formatter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can delete this file? @SolitaryThinker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need a separate predction.py?.

  • First, this file feels like a bit out of place.
  • Second, remember we are pursuing a cli like interface like fastvideo generate --args

env_setup.sh Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just put dependency in toml and remove this file

@@ -0,0 +1,213 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if you have done this, but if this is copied/adapted from vLLM, let's mark the source file in the header?

def load(self, model_path: str, architecture: str, inference_args: InferenceArgs):
"""Load the scheduler based on the model path, architecture, and inference args."""

scheduler = get_scheduler(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just educate me: what do we need to load for scheduler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about the encoder design here. We have a models/encoder folder where we put different encoder models, yet we have a separate text_encoder.py. What are their boundaries?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, why we need a composed package? This package is rather thin. Seems to only provide a abstraction on the ComposePipeline object?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a simple structure is just like:

pipllines/
-- composed_pipeline_base.py
-- hunyuan_pipeline.py
-- pipeline_regstry.py
-- stages/
...?

# prompt_template
prompt_template = PROMPT_TEMPLATE["image"]

# prompt_template_video
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel the constant.py is unnecessary; could just put everything into one hunyuan_pipeline.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, i think it's not good to split


# or

batch = self.input_validation_stage(batch, inference_args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @comaniac. I am confused here -- in the above you already have clearly defined stages? so the block block below could just be discarded?

Copy link

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with others comment about avoiding the add_stage API compared to something more fluent


class HunyuanVideoPipeline(ComposedPipelineBase):

def required_config_modules(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a func and not a property or a class var?

Comment on lines +51 to +52
def initialize_encoders(self, inference_args: InferenceArgs):
self.initialize_encoders_v1(inference_args)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like boilerplate?

# prompt_template
prompt_template = PROMPT_TEMPLATE["image"]

# prompt_template_video

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, i think it's not good to split

encoder_1.to(dtype=PRECISION_TO_TYPE[inference_args.text_encoder_precision])
encoder_1.requires_grad_(False)

print(f"keys: {self.modules.keys()}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

device=inference_args.device if not inference_args.use_cpu_offload else "cpu",
)

encoder_2 = self.modules.pop("text_encoder_2")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have to pop

Comment on lines 107 to 116
text_encoder_2 = TextEncoder(
text_encoder=encoder_2,
tokenizer=tokenizer_2,
# text_encoder_type="text_encoder_2",
max_length=inference_args.text_len_2,
# text_encoder_precision=inference_args.text_encoder_precision,
device=inference_args.device if not inference_args.use_cpu_offload else "cpu",
)
self.modules["text_encoder"] = text_encoder
self.modules["text_encoder_2"] = text_encoder_2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this pop and set thing? Feels like you should have something more functional that doesn't make the user do all of this boilerplate -- like you set device in 2 places, requires_grad_ and finding the tokenizer is basically boilerplate

Comment on lines 120 to 121
modules["text_encoder"] = text_encoder
modules["text_encoder_2"] = text_encoder_2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question I think as my above one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants