BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Asia/Hong_Kong
X-LIC-LOCATION:Asia/Hong_Kong
BEGIN:STANDARD
TZOFFSETFROM:+0800
TZOFFSETTO:+0800
TZNAME:HKT
DTSTART:19911015T033000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20251218T030656Z
LOCATION:Meeting Room S221\, Level 2
DTSTART;TZID=Asia/Hong_Kong:20251216T104000
DTEND;TZID=Asia/Hong_Kong:20251216T105000
UID:siggraphasia_SIGGRAPH Asia 2025_sess119_papers_1264@linklings.com
SUMMARY:Audio Driven Real-Time Facial Animation for Social Telepresence
DESCRIPTION:Jiye Lee (Seoul National University); Chenghui Li, Linh Tran, 
 Shih-En Wei, Jason Saragih, and Alexander Richard (Codec Avatars Lab, Meta
 ); Hanbyul Joo (Seoul National University); and Shaojie Bai (Codec Avatars
  Lab, Meta)\n\nWe present an audio-driven real-time system for animating p
 hotorealistic 3D facial avatars with minimal latency, designed for social 
 interactions in virtual reality for anyone. Central to our approach is an 
 encoder model that transforms audio signals into latent facial expression 
 sequences in real time, which are then decoded as photorealistic 3D facial
  avatars. Leveraging the generative capabilities of diffusion models, we c
 apture the rich spectrum of facial expressions necessary for natural commu
 nication while achieving real-time performance (<15ms GPU time). Our novel
  architecture minimizes latency through two key innovations: an online tra
 nsformer that eliminates dependency on future inputs and a distillation pi
 peline that accelerates iterative denoising into a single step. We further
  address critical design challenges in live scenarios for processing conti
 nuous audio signals frame-by-frame while maintaining consistent animation 
 quality. The versatility of our framework extends to multimodal applicatio
 ns, including semantic modalities such as emotion conditions and multimoda
 l sensors with head-mounted eye cameras on VR headsets. Experimental resul
 ts demonstrate significant improvements in facial animation accuracy over 
 existing offline state-of-the-art baselines, achieving 100 to 1000 times f
 aster inference speed. We validate our approach through live VR demonstrat
 ions and across various scenarios such as multilingual speeches.\n\nRegist
 ration Category: Full Access, Full Access Supporter\n\nSession Chair: Feng
  Xu (Tsinghua University)\n\n
END:VEVENT
END:VCALENDAR
