Project Overview

We organized the IROS 2025 Challenge, which includes two tracks: Manipulation and Navigation. The challenge was also featured in the IROS 2025 Workshop. I personally participated in maintaining the Manipulation track, namely Vision-Language Manipulation in Open Tabletop Environments. The online competition submission deadline is September 30, and everyone is welcome to participate!

Supplement: Congratulations to the successful conclusion of the IROS 2025 Hangzhou on-site final! I had the opportunity to participate in real-world robot debugging on-site and also served as one of the chief judges. After comprehensive evaluation, we are pleased to congratulate Team HonorEmbodiment for winning the Manipulation track championship.

Contributions As the evaluation lead for the manipulation track, I primarily participated in and completed the following work:

Built the GenManip benchmark evaluation environment
Developed the system framework from scratch based on InternUtopia
Implemented the evaluation pipeline and protocol in InternManip
Conducted real-robot debugging during the offline finals and served as one of the chief judges and technical advisors

Environment Composition

Tasks

A total of 10 manipulation tasks, divided into two categories:
- Seen (objects appearing in the training set)
- Unseen (previously unseen novel objects)
Each task and category contains 10 data samples, including USD files and metadata

Dataset link: huggingface dataset

validation
├── IROS_C_V3_Aloha_seen
│   ├── collect_three_glues/
│   │   ├── 000
│   │   │   ├── meta_info.pkl
│   │   │   ├── scene.usd
│   │   │   └── SubUSDs -> ../SubUSDs
│   │   ├── 001/
│   │   ├── ...
│   │   └── 009/
│   ├── collect_two_alarm_clocks/
│   ├── collect_two_shoes/
│   ├── gather_three_teaboxes/
│   ├── make_sandwich/
│   ├── oil_painting_recognition/
│   ├── organize_colorful_cups/
│   ├── purchase_gift_box/
│   ├── put_drink_on_basket/
│   └── sort_waste/
└── IROS_C_V3_Aloha_unseen
    └── ...

Robots

Gesture-based teleoperation system architecture overview — Franka robotic arm + Panda gripper

Controllers

Joint position control
Inverse kinematics solver

Observation & Action Space

The following is an example using the Franka robot. For detailed specifications, please refer to the I/O specification documentation.

Observation Space (Franka)

observations: List[Dict] = [
    {
        "robot": {
            "robot_pose": (position, orientation),
            "joints_state": {
                "positions": array,
                "velocities": array
            },
            "eef_pose": (position, orientation),
            "sensors": {
                "realsense": {
                    "rgb": (480, 640, 3),
                    "depth": (480, 640)
                },
                "obs_camera": {...},
                "obs_camera_2": {...}
            },
            "instruction": str,
            "metric": {
                "task_name": str,
                "episode_name": str,
                "episode_sr": int,
                "first_success_step": int,
                "episode_step": int
            },
            "step": int,
            "render": bool
        }
    }
]

Action Space (Franka)

Three formats are supported:

List[float]

{
    'arm_action': List[float],
    'gripper_action': Union[List[float], int]
}

{
    'eef_position': List[float],
    'eef_orientation': List[float],
    'gripper_action': Union[List[float], int]
}

Sensors

Franka: front-facing view / gripper first-person view / rear-side view
Aloha: head view / left gripper first-person view / right gripper first-person view

Example: Franka & Panda front-facing view

Example: Franka & Panda gripper first-person view

Example: Franka & Panda rear-side view

Metrics

The primary metric is success rate:

Soft success: partial task completion counts as partial success (used in the competition)
Hard success: only full completion of all subtasks is considered success

Additional Components

Developed a custom recorder for asynchronously capturing image frames and logging both state and image data at each timestep, improving runtime efficiency.
Implemented support for batch evaluation across multiple environments or parallel instances of Isaac Sim.
A full list of configurable parameters and additional features can be found in the official documentation.

InternManip Integration

The above evaluation environment is integrated as a benchmark module in InternManip.

Main implementations include:

wrapper environment
evaluator
Ray-based parallel evaluation
agent / model integration interface

The evaluation functionality in InternManip has been refactored into InternManip-Eval, which is introduced in a separate project page.