Trying to reproduce this portion of the paper for Avatar Driving Pose Video #169

ArEnSc · 2025-01-07T06:28:11Z

Paper:
https://arxiv.org/pdf/2412.03603
7.3 Avatar animation

we encode reference image using 3DVAE obtaining `zref ∈ R 1×c×h×w, where c = 16.` 
Then we repeat it t times along
temporal dimension and concatenate with zt in the channel dimension 
to get the modified noise input
`zˆt ∈ R t×2c×h×w`.

We can control the digital character’s body movements explicitly using pose templates.
We use Dwpose [92] to detect skeletal video from any source video, and use 3DVAE to transform
it to latent space as zpose. We argue that this eases the fine-tuning process because both input and
driving videos are in image representation, and are encoded with shared VAE, resulting in the same latent
space. We then inject the driving signals to the model by element-wise add as zˆt + zpose. Note that
zˆt contains the appearance information of the reference image. We use full-parameters finetune with
pretrained T2V weights as initialization

How do we element-wise add the zpose as conditioning?
Given:
zˆt ∈ R t×2c×h×w.
Do we only add it to the noise? before concatenating?

and should it enter the transformer as
[ img reference ] [ noise ] ( I suspect it is this )
[ noise ] [ img reference ] ?

thanks

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to reproduce this portion of the paper for Avatar Driving Pose Video #169

Trying to reproduce this portion of the paper for Avatar Driving Pose Video #169

ArEnSc commented Jan 7, 2025

Trying to reproduce this portion of the paper for Avatar Driving Pose Video #169

Trying to reproduce this portion of the paper for Avatar Driving Pose Video #169

Comments

ArEnSc commented Jan 7, 2025