EBench Introduction
What is EBench?
EBench is an indoor VLA manipulation benchmark built on NVIDIA Isaac Sim. Instead of compressing a model's behavior into a single overall success rate, it produces a multi-axis capability profile that makes a model's strengths and weaknesses readable, comparable, and diagnosable.
This repository serves as the project entry point, hosting baseline implementations and utility scripts. The simulation runtime, gmp CLI, and datasets are maintained in their own dedicated repositories (see the navigation badges above).
What Makes EBench Different
- Comprehensive Coverage of Three Manipulation Regimes — simultaneously evaluates Long-Horizon, Dexterous & Precise, and Mobile manipulation tasks, rather than focusing on only one regime.
- 5-Axis Atomic Diagnostics — every task is labeled by Scene · Atomic Skill · Horizon · Precision · Mobility, turning a black-box score into an interpretable strength-and-weakness map.
- 4-Axis Generalization Evaluation — controlled perturbations along Object · Background · Instruction · Mixed allow performance degradation to be attributed to specific failure modes.
- Strict Train–Test Isolation —
val_trainandval_unseenare publicly available for tuning, while the held-outtest(Test-Mini) split drives the leaderboard, ensuring scores reflect genuine generalization rather than adaptation to the evaluation distribution.
The benchmark provides two evaluation tracks: Specialist (Tabletop or Mobile Manipulation) and Generalist (covering both regimes simultaneously).
For the full methodology, task taxonomy, and design rationale behind each evaluation axis, please refer to the project documentation.
Project Components
| Component | Repository | Description |
|---|---|---|
| EBench | InternRobotics/EBench | Reference baselines, scripts, and project entry point |
| GenManip | InternRobotics/GenManip | Isaac Sim evaluation server and task configurations |
| genmanip-client | InternRobotics/genmanip-client | gmp CLI and EvalClient Python API |
| EBench-Assets | 🤗 EBench-Assets | Scenes, objects, and task assets |
| EBench-Dataset | 🤗 EBench-Dataset | Training trajectories (LeRobot format) |
| Documentation Site | internrobotics.github.io/EBench-doc | Environment setup, evaluation workflow, and CLI reference |
| Online Challenge | internrobotics.shlab.org.cn/eval | Remote evaluation, leaderboard, and diagnostic reports |
Quick Start
EBench follows a client–server architecture. The server runs Isaac Sim, while the client (gmp CLI) is a lightweight package that can be installed directly into the Python environment where the model runs.
# 1. Start the evaluation server → see Environment Setup
# https://internrobotics.github.io/EBench-doc/zh-cn/getting-started/environment/
# 2. Install the client in your model environment
git clone https://github.com/InternRobotics/genmanip-client.git
cd genmanip-client && pip install -e .
# 3. Run an evaluation
gmp submit ebench/generalist/test --run_id my_first_run
gmp eval -a r5a -g lift2 --worker_ids 0
gmp status
A full validation run takes approximately 30 minutes on 8× RTX 4090 GPUs. For detailed environment setup, asset downloads, and complete gmp command references, please refer to the documentation site.
Task Overview
26 task types are organized into three task families: Long-Horizon, Pick-and-Place, and Dexterous & Precise. Combined with the four generalization axes and three dataset splits, they form a total of 794 evaluation task instances. A complete video demonstration is available in the Task Showcase.
Baselines
Reference policies are located under baselines/<name>/, with each baseline providing its own README and a gmp eval-compatible entry point. EBench has completed its first-round validation on π0, π0.5, X-VLA, and InternVLA-A1. Current results and per-axis diagnostic reports can be found on the leaderboard.
To integrate your own model, please refer to Integrate Your Own Model.
Online Challenge
The online evaluation platform at internrobotics.shlab.org.cn/eval is available 24/7. All submissions are executed on the held-out Test-Mini split using a standardized and reproducible evaluation protocol, and automatically generate diagnostic reports, including capability radar charts, validation-to-test transfer curves, generalization radars, and task-level heatmaps. For submission instructions, see the Challenge page.
Citation
The paper is currently in preparation. Before the official release:
@misc{ebench2026,
title = {EBench: Elemental Mobile Manipulation Benchmark},
author = {Shanghai AI Laboratory},
year = {2026},
note = {Preprint coming soon},
url = {https://internrobotics.github.io/EBench-doc/}
}
License
MIT. See LICENSE for details. Built on top of NVIDIA Isaac Sim, cuRobo, and the LeRobot data format. Issues and pull requests are welcome.
My Contributions
Within the EBench project, I primarily focused on dexterous manipulation data collection and evaluation infrastructure development.
- Contributed to the development of the EBench evaluation framework, continuously improving task configurations, evaluation workflows, and overall engineering quality.
- Participated in the design and implementation of a VR + teleoperation data collection system covering seven dexterous manipulation tasks, enabling large-scale demonstration data acquisition.
- Built a complete data pipeline spanning teleoperation collection, trajectory processing, and quality control, improving both data production efficiency and consistency.
- Established a closed-loop workflow connecting benchmark evaluation, data analysis, model training, and evaluation validation, collaborating closely with the training team to accelerate model iteration.
- Supported the development of EBench's task taxonomy and diagnostic capabilities, helping identify key bottlenecks in dexterous manipulation and generalization performance.