MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos

1 ETH Zürich, 2 Mimic Robotics, 3 Microsoft Research

We present MAPLE, a framework that learns dexterous manipulation priors from egocentric videos and produces features well-suited for downstream dexterous robotic manipulation tasks. Experiments in both simulation and real-world settings demonstrate that MAPLE enables efficient policy learning and improves generalization across various tasks.

  • Validated in reality and simulation: To ensure reproducibility, we report results in both simulated environments as well as real-world settings, achieving state-of-the-art success rates across numerous baselines.
  • Generalizes across tasks: MAPLE shows strong results beyond dexterous manipulation, in tasks such as contact point prediction.
  • Strong zero-shot performance: MAPLE outperforms the baselines for experiments involving unseen objects as well as distractor objects.
  • New evaluation benchmarks: Addressing the shortage in dexterous evaluation benchmarks in literature, we introduce 4 new challenging dexterous simulation environments involving the manipulation of everyday objects.

Abstract

Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation.

To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that learns features to predict object contact points and detailed hand poses at the moment of contact from egocentric images. We then use the learned features to train policies for downstream manipulation tasks.

Experimental results demonstrate the effectiveness of MAPLE across 4 existing simulation benchmarks, as well as a newly designed set of 4 challenging simulation tasks requiring fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a 17 DoF dexterous robotic hand, whereas the simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work. We additionally showcase the efficacy of our model on an egocentric contact point prediction task, validating its usefulness beyond dexterous manipulation policy learning.

Method

We train a visual encoder on large-scale egocentric videos capturing diverse hand-object interactions and evaluate its effectiveness on downstream dexterous robotic manipulation tasks.

Given a single input frame, the encoder is trained to reason about hand-object interactions, specifically predicting contact points and grasping hand poses. This training infuses a manipulation prior into the learned feature representation, making it well-suited for downstream robotic manipulation.

Features extracted from the frozen visual encoder, combined with robotic hand positions, are fed into policy networks to predict dexterous hand action sequences. We find that MAPLE enables efficient policy learning across a range of simulated and real-world dexterous manipulation tasks.

Imitation Learning Experiments

Success rates on real-world experiments. Our results indicate that our encoder consistently outperforms alternative approaches.
In contrast, other encoder-based methods exhibit behaviors that make them less suitable for our tasks.
Failure case analysis for each method, aggregated across all real-world experiments.
MAPLE is the only method that never caused a safety abort and showed the least localization errors.

Zero-Shot Experiments

We modify the “Wash the Dish” task to use a sponge of a different color and shape than seen during training (“Unseen Object”) and report success rates.
We further conduct evaluations on the “Place the Pan” task when an additional pan is present in the image as a “Distractor Object”.
The policies using our features outperform those using competing encoders, demonstrating the promising zero-shot capabilities and robustness of MAPLE.

Contact Point Prediction

Our encoder allows for high-quality contact point predictions on egocentric videos. MAPLE achieves the best performance on both SIM and NSS metrics compared against our baselines.

New Dexterous Simulation Environments

We introduce new dexterous simulation environments, facilitating benchmarking and fostering further research by alleviating the shortage of environments focusing on dexterous settings.

Acknowledgements

This work was supported as part of the Swiss AI Initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a03 on Alps, as well as the Swiss National Science Foundation Advanced Grant 216260: “Beyond Frozen Worlds: Capturing Functional 3D Digital Twins from the Real World”.

BibTeX

If you found our work useful, please consider citing it:
@article{gavryushin2025maple,
  title={{MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos}}, 
  author={Gavryushin, Alexey and Wang, Xi and Malate, Robert J. S. and Yang, Chenyu and Liconti, Davide and Zurbr{\"u}gg, Ren{\'e} and Katzschmann, Robert K. and Pollefeys, Marc},
  eprint={2504.06084},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  journal={arXiv preprint arXiv:2504.06084},
  year={2025},
  url={https://arxiv.org/abs/2504.06084}
}