Explore the Full Program of SIGGRAPH Asia 2025!
Close

Presentation

Unifying Latent Action and Latent State Pre-training for Policy Learning from Videos
DescriptionVideo data provides an accessible and rich source beyond expensive action-labeled robot data for advancing robotic learning paradigms. Motivated by this potential, researchers investigate methods to exploit video data in robotic learning. Recent approaches can be primarily divided into two categories: Action-based approaches tokenize latent actions from videos for policy pre-training. State-based approaches pre-train models to predict subsequent states. The former establishes rich motion priors, while the latter empowers the robot to anticipate future events. These complementary capabilities suggest significant potential for integration into a unified framework. In this paper, we propose UniMimic, a novel approach unifying latent action and latent state pre-training from videos. We first train a unified tokenizer to learn latent states from video frames while deriving latent actions between state tokens. Subsequently, the policy is pre-trained on videos to predict these latent actions and subsequent latent states. Finally, the policy is fine-tuned on an action-labeled robot dataset to transfer the learned priors to precise robot execution. Experiments exhibit that our pre-training stage enhances the performance by 19% in the Libero benchmark and improves the average number of tasks completed in a row of 5 from 2.50 and 2.35 to 3.89 and 3.73 in the CALVIN benchmark. In the real-world experiments, our method still delivers improvements exceeding 36%.