MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos

1 ETH Zürich, 2 Mimic Robotics, 3 Microsoft Research

MAPLE extracts well-generalizable visual features for downstream dexterous robotic manipulation tasks.

Abstract

Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation.

To address this gap, we leverage manipulation priors learned from large-scale egocentric videos to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that exploits rich manipulation priors to enable efficient policy learning and better performance on diverse, complex manipulation tasks. Specifically, we predict hand-object contact points and detailed hand poses at the moment of contact and use the learned features to train policies for downstream manipulation tasks.

Experimental results demonstrate the effectiveness of MAPLE across existing simulation benchmarks, as well as a newly designed set of challenging simulation tasks, which require fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a dexterous robotic hand, whereas simultaneous evaluation across both simulation and real-world experiments has often been underexplored in prior work.

Method

We train a visual encoder on large-scale egocentric videos capturing diverse hand-object interactions and evaluate its effectiveness on downstream dexterous robotic manipulation tasks.

Given a single input frame, the encoder is trained to reason about hand-object interactions, specifically predicting contact points and grasping hand poses. This training infuses a manipulation prior into the learned feature representation, making it well-suited for downstream robotic manipulation.

Features extracted from the frozen visual encoder, combined with robotic hand positions, are fed into policy networks to predict dexterous hand action sequences. We find that MAPLE enables efficient policy learning across a range of simulated and real-world dexterous manipulation tasks.

New Dexterous Simulation Environments

We introduce new dexterous simulation environments, facilitating benchmarking and fostering further research by alleviating the shortage of environments focusing on dexterous settings.

Acknowledgements

The authors wish to express their gratitude to mimic robotics for providing the robotic hand used in the real-world experiments.

This work was supported as part of the Swiss AI Initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a03 on Alps, as well as the Swiss National Science Foundation Advanced Grant 216260: “Beyond Frozen Worlds: Capturing Functional 3D Digital Twins from the Real World”.

BibTeX

If you found our work useful, please consider citing it:
@article{gavryushin2025maple,
  title={{MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos}}, 
  author={Gavryushin, Alexey and Wang, Xi and Malate, Robert J. S. and Yang, Chenyu and Jia, Xiangyi and Goel, Shubh and Liconti, Davide and Zurbr{\"u}gg, Ren{\'e} and Katzschmann, Robert K. and Pollefeys, Marc},
  eprint={2504.06084},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  journal={arXiv preprint arXiv:2504.06084},
  year={2025},
  url={https://arxiv.org/abs/2504.06084}
}