Explore the Full Program of SIGGRAPH Asia 2025!
Close

Presentation

X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents
DescriptionWe present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granularity human motion directly from a single image into a compact set of four disentangled latent tokens—one each for facial expression and body pose, and one per hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with distinct identity attributes and pose configurations.
To achieve this, we introduce a self-supervised, end-to-end training framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale video datasets spanning diverse human motions. Motion-identity disentanglement is enforced via spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs. We further guide the learning of motion tokens using auxiliary spatial decoders to promote fine-grained, semantically aligned, and depth-aware motion embeddings.
Extensive experiments demonstrate that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion expressiveness and identity preservation.