Method
Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation.
To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that learns features to predict object contact points and detailed hand poses at the moment of contact from egocentric images. We then use the learned features to train policies for downstream manipulation tasks.
Experimental results demonstrate the effectiveness of MAPLE across 4 existing simulation benchmarks, as well as a newly designed set of 4 challenging simulation tasks requiring fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a 17 DoF dexterous robotic hand, whereas the simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work. We additionally showcase the efficacy of our model on an egocentric contact point prediction task, validating its usefulness beyond dexterous manipulation policy learning.
We train a visual encoder on large-scale egocentric videos capturing diverse hand-object interactions and evaluate its effectiveness on downstream dexterous robotic manipulation tasks.
Given a single input frame, the encoder is trained to reason about hand-object interactions, specifically predicting contact points and grasping hand poses. This training infuses a manipulation prior into the learned feature representation, making it well-suited for downstream robotic manipulation.
Features extracted from the frozen visual encoder, combined with robotic hand positions, are fed into policy networks to predict dexterous hand action sequences. We find that MAPLE enables efficient policy learning across a range of simulated and real-world dexterous manipulation tasks.
We introduce new dexterous simulation environments, facilitating benchmarking and fostering further research by alleviating the shortage of environments focusing on dexterous settings.
This work was supported as part of the Swiss AI Initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a03 on Alps, as well as the Swiss National Science Foundation Advanced Grant 216260: “Beyond Frozen Worlds: Capturing Functional 3D Digital Twins from the Real World”.
@article{gavryushin2025maple,
title={{MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos}},
author={Gavryushin, Alexey and Wang, Xi and Malate, Robert J. S. and Yang, Chenyu and Liconti, Davide and Zurbr{\"u}gg, Ren{\'e} and Katzschmann, Robert K. and Pollefeys, Marc},
eprint={2504.06084},
archivePrefix={arXiv},
primaryClass={cs.RO},
journal={arXiv preprint arXiv:2504.06084},
year={2025},
url={https://arxiv.org/abs/2504.06084}
}