BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Asia/Hong_Kong
X-LIC-LOCATION:Asia/Hong_Kong
BEGIN:STANDARD
TZOFFSETFROM:+0800
TZOFFSETTO:+0800
TZNAME:HKT
DTSTART:19911015T033000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20251218T030657Z
LOCATION:Meeting Room S426+S427\, Level 4
DTSTART;TZID=Asia/Hong_Kong:20251217T112300
DTEND;TZID=Asia/Hong_Kong:20251217T113400
UID:siggraphasia_SIGGRAPH Asia 2025_sess136_papers_2031@linklings.com
SUMMARY:Unifying Latent Action and Latent State Pre-training for Policy Le
 arning from Videos
DESCRIPTION:Guangyan Chen, Meiling Wang, Te Cui, Luojie Yang, and Qi Shao 
 (Beijing Institute of Technology); Lin Zhao, Tianle Zhang, and Yihang Li (
 JD); and Yi Yang and Yufeng Yue (Beijing Institute of Technology)\n\nVideo
  data provides an accessible and rich source beyond expensive action-label
 ed robot data for advancing robotic learning paradigms. Motivated by this 
 potential, researchers investigate methods to exploit video data in roboti
 c learning. Recent approaches can be primarily divided into two categories
 : Action-based approaches tokenize latent actions from videos for policy p
 re-training. State-based approaches pre-train models to predict subsequent
  states. The former establishes rich motion priors, while the latter empow
 ers the robot to anticipate future events. These complementary capabilitie
 s suggest significant potential for integration into a unified framework. 
 In this paper, we propose UniMimic, a novel approach unifying latent actio
 n and latent state pre-training from videos. We first train a unified toke
 nizer to learn latent states from video frames while deriving latent actio
 ns between state tokens. Subsequently, the policy is pre-trained on videos
  to predict these latent actions and subsequent latent states. Finally, th
 e policy is fine-tuned on an action-labeled robot dataset to transfer the 
 learned priors to precise robot execution. Experiments exhibit that our pr
 e-training stage enhances the performance by 19% in the Libero benchmark a
 nd improves the average number of tasks completed in a row of 5 from 2.50 
 and 2.35 to 3.89 and 3.73 in the CALVIN benchmark. In the real-world exper
 iments, our method still delivers improvements exceeding 36%.\n\nRegistrat
 ion Category: Full Access, Full Access Supporter\n\nSession Chair: Yinghao
  Xu (Stanford University)\n\n
END:VEVENT
END:VCALENDAR
