Explore the Full Program of SIGGRAPH Asia 2025!
Close

Presentation

Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation
DescriptionRecent works use diffusion models for realistic co-speech video synthesis from audio, with various applications like video creation and virtual agents. However, existing diffusion-based methods are very slow due to many denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we aim to distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent model distillation methods degrades video quality and falls short of real-time performance. To address these issues, we introduce a new video distillation method that leverages input human pose conditioning for both attention and loss functions. For attention, we propose sparse attention across frames, using accurate correspondence from input human pose keypoints to guide attention to salient dynamic regions like the speaker's face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further improve visual quality, we introduce an input-aware region loss that enhances lip synchronization and hand motion realism. Integrating our input-aware sparse attention and region loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct an extensive ablation study showing the effectiveness of our designs.