Explore the Full Program of SIGGRAPH Asia 2025!
Close

Presentation

THADT: Temporal Hybrid Attention Diffusion Transformer for Human Pose Prediction
DescriptionHuman pose prediction is a key technology for virtual environmental choreography in dance education. However, traditional deterministic prediction methods are unable to capture the diverse distribution of human poses, which limits their practical applicability. Thus, we propose a Temporal Hybrid Attention Diffusion Transformer (THADT) model for 3D to 3D prediction, which consists of forward diffusion and reverse generation processes. During forward diffusion, the discrete cosine transform converts human poses into frequency-domain features while gradually adding noise and training a denoising network to learn the noise distribution. In reverse generation, the model progressively removes noise by integrating historical pose data as conditional input, ultimately reconstructing future pose sequences through inverse transformation. Experimental results on the Human 3.6M and HumanEva-I datasets demonstrate that THADT outperforms existing state-of-the-art methods across key metrics such as ADE, FDE, and MMADE.