You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
we encode reference image using 3DVAE obtaining `zref ∈ R 1×c×h×w, where c = 16.`
Then we repeat it t times along
temporal dimension and concatenate with zt in the channel dimension
to get the modified noise input
`zˆt ∈ R t×2c×h×w`.
We can control the digital character’s body movements explicitly using pose templates.
We use Dwpose [92] to detect skeletal video from any source video, and use 3DVAE to transform
it to latent space as zpose. We argue that this eases the fine-tuning process because both input and
driving videos are in image representation, and are encoded with shared VAE, resulting in the same latent
space. We then inject the driving signals to the model by element-wise add as zˆt + zpose. Note that
zˆt contains the appearance information of the reference image. We use full-parameters finetune with
pretrained T2V weights as initialization
How do we element-wise add the zpose as conditioning?
Given: zˆt ∈ R t×2c×h×w.
Do we only add it to the noise? before concatenating?
and should it enter the transformer as
[ img reference ] [ noise ] ( I suspect it is this )
[ noise ] [ img reference ] ?
thanks
The text was updated successfully, but these errors were encountered:
Paper:
https://arxiv.org/pdf/2412.03603
7.3 Avatar animation
How do we element-wise add the zpose as conditioning?
Given:
zˆt ∈ R t×2c×h×w
.Do we only add it to the noise?
before
concatenating?The text was updated successfully, but these errors were encountered: