Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to reproduce this portion of the paper for Avatar Driving Pose Video #169

Open
ArEnSc opened this issue Jan 7, 2025 · 0 comments

Comments

@ArEnSc
Copy link

ArEnSc commented Jan 7, 2025

Screenshot 2025-01-06 234755

Paper:
https://arxiv.org/pdf/2412.03603
7.3 Avatar animation

we encode reference image using 3DVAE obtaining `zref ∈ R 1×c×h×w, where c = 16.` 
Then we repeat it t times along
temporal dimension and concatenate with zt in the channel dimension 
to get the modified noise input
`zˆt ∈ R t×2c×h×w`. 
We can control the digital character’s body movements explicitly using pose templates.
We use Dwpose [92] to detect skeletal video from any source video, and use 3DVAE to transform
it to latent space as zpose. We argue that this eases the fine-tuning process because both input and
driving videos are in image representation, and are encoded with shared VAE, resulting in the same latent
space. We then inject the driving signals to the model by element-wise add as zˆt + zpose. Note that
zˆt contains the appearance information of the reference image. We use full-parameters finetune with
pretrained T2V weights as initialization

How do we element-wise add the zpose as conditioning?
Given:
zˆt ∈ R t×2c×h×w.
Do we only add it to the noise? before concatenating?

and should it enter the transformer as
[ img reference ] [ noise ] ( I suspect it is this )
[ noise ] [ img reference ] ?

thanks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant