BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Asia/Hong_Kong
X-LIC-LOCATION:Asia/Hong_Kong
BEGIN:STANDARD
TZOFFSETFROM:+0800
TZOFFSETTO:+0800
TZNAME:HKT
DTSTART:19911015T033000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20251218T030657Z
LOCATION:Meeting Room S426+S427\, Level 4
DTSTART;TZID=Asia/Hong_Kong:20251217T110100
DTEND;TZID=Asia/Hong_Kong:20251217T111200
UID:siggraphasia_SIGGRAPH Asia 2025_sess136_papers_2382@linklings.com
SUMMARY:Echo: Enhancing Conversational Behavior Generation via Hierarchica
 l Semantic Comprehension with Large Language Models
DESCRIPTION:Haiwei Xue (Tsinghua University), Yanbo Fan (Nanjing Universit
 y), Xuan Wang (Ant Group), and Zhiyong Wu (Tsinghua University)\n\nConvers
 ational behavior generation, being a crucial capability of embodied agents
 , is a significant factor influencing human-computer interaction. Generati
 ng high-quality conversational motions requires not only appropriate audio
 -motion mapping but also interactive responses to interlocutor behaviors a
 nd comprehensive understanding of conversational semantics. Existing metho
 ds primarily rely on audio signals and interlocutor motions for main agent
  motion generation, lacking high-level semantic understanding of the conve
 rsational content, leading to moderate quality motions that are not approp
 riate for the dialogue. To address these limitations, we leverage the powe
 rful semantic understanding capabilities of large language models, to comp
 rehend complex conversational contexts. Inspired by human conversation pro
 cesses that conversational motions are highly related to both global and l
 ocal semantic factors, including the conversational context, and the inten
 tions, emotions, and passive or active states of the participants, we prop
 ose an agentic system named Echo that analyzes such information. To achiev
 e comprehensive conversational understanding, Echo leverages multiple prom
 pts and test-time recipes to guide large language models in decomposing co
 nversational structures and extracting fine-grained semantic information. 
 Furthermore, we design a hierarchical feature fusion network that systemat
 ically integrates from frame-level audio-motion features to sentence-level
  semantic understanding and finally to conversation-level contextual compr
 ehension, organically combining fine-grained semantic features from large 
 language models with audio and motion characteristics. Experimental result
 s demonstrate that our framework can be effectively integrated with severa
 l state-of-the-art motion generation models to enhance their performance i
 n generating high-quality conversational behaviors. Code and data for this
  paper are at \textit{https://github.com/Echo-Motion/Echo.\n\nRegistration
  Category: Full Access, Full Access Supporter\n\nSession Chair: Yinghao Xu
  (Stanford University)\n\n
END:VEVENT
END:VCALENDAR
