Explore the Full Program of SIGGRAPH Asia 2025!
Close

Presentation

X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio
DescriptionWe present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior approaches that primarily focus on lip synchronization and visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-range portrait acting that captures diverse, fine-grained, and dynamically evolving yet temporally consistent human emotions, flowing and transitioning in sync with the audio dynamics. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive model effectively learns long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.