Explore the Full Program of SIGGRAPH Asia 2025!
Close

Presentation

Echo: Enhancing Conversational Behavior Generation via Hierarchical Semantic Comprehension with Large Language Models
DescriptionConversational behavior generation, being a crucial capability of embodied agents, is a significant factor influencing human-computer interaction. Generating high-quality conversational motions requires not only appropriate audio-motion mapping but also interactive responses to interlocutor behaviors and comprehensive understanding of conversational semantics. Existing methods primarily rely on audio signals and interlocutor motions for main agent motion generation, lacking high-level semantic understanding of the conversational content, leading to moderate quality motions that are not appropriate for the dialogue. To address these limitations, we leverage the powerful semantic understanding capabilities of large language models, to comprehend complex conversational contexts. Inspired by human conversation processes that conversational motions are highly related to both global and local semantic factors, including the conversational context, and the intentions, emotions, and passive or active states of the participants, we propose an agentic system named Echo that analyzes such information. To achieve comprehensive conversational understanding, Echo leverages multiple prompts and test-time recipes to guide large language models in decomposing conversational structures and extracting fine-grained semantic information. Furthermore, we design a hierarchical feature fusion network that systematically integrates from frame-level audio-motion features to sentence-level semantic understanding and finally to conversation-level contextual comprehension, organically combining fine-grained semantic features from large language models with audio and motion characteristics. Experimental results demonstrate that our framework can be effectively integrated with several state-of-the-art motion generation models to enhance their performance in generating high-quality conversational behaviors. Code and data for this paper are at \textit{https://github.com/Echo-Motion/Echo.