Back

IROS 2025 Challenge

Project Overview

We organized the IROS 2025 Challenge, which includes two tracks: Manipulation and Navigation. The challenge was also featured in the IROS 2025 Workshop. I personally participated in maintaining the Manipulation track, namely Vision-Language Manipulation in Open Tabletop Environments. The online competition submission deadline is September 30, and everyone is welcome to participate!

Supplement: Congratulations to the successful conclusion of the IROS 2025 Hangzhou on-site final! I had the opportunity to participate in real-world robot debugging on-site and also served as one of the chief judges. After comprehensive evaluation, we are pleased to congratulate Team HonorEmbodiment for winning the Manipulation track championship.


Contributions As the evaluation lead for the manipulation track, I primarily participated in and completed the following work:

  • Built the GenManip benchmark evaluation environment
  • Developed the system framework from scratch based on InternUtopia
  • Implemented the evaluation pipeline and protocol in InternManip
  • Conducted real-robot debugging during the offline finals and served as one of the chief judges and technical advisors

Environment Composition

Tasks

  • A total of 10 manipulation tasks, divided into two categories:

    • Seen (objects appearing in the training set)
    • Unseen (previously unseen novel objects)
  • Each task and category contains 10 data samples, including USD files and metadata

Dataset link: huggingface dataset

validation
├── IROS_C_V3_Aloha_seen
│   ├── collect_three_glues/
│   │   ├── 000
│   │   │   ├── meta_info.pkl
│   │   │   ├── scene.usd
│   │   │   └── SubUSDs -> ../SubUSDs
│   │   ├── 001/
│   │   ├── ...
│   │   └── 009/
│   ├── collect_two_alarm_clocks/
│   ├── collect_two_shoes/
│   ├── gather_three_teaboxes/
│   ├── make_sandwich/
│   ├── oil_painting_recognition/
│   ├── organize_colorful_cups/
│   ├── purchase_gift_box/
│   ├── put_drink_on_basket/
│   └── sort_waste/
└── IROS_C_V3_Aloha_unseen
    └── ...

Robots

Gesture-based teleoperation system architecture overview
Franka robotic arm + Panda gripper
Gesture-based teleoperation system architecture overview
Franka robotic arm + Robotiq gripper
Gesture-based teleoperation system architecture overview
Aloha dual-arm robot (used in competition)

Controllers

  • Joint position control
  • Inverse kinematics solver

Observation & Action Space

The following is an example using the Franka robot. For detailed specifications, please refer to the I/O specification documentation.

Observation Space (Franka)

observations: List[Dict] = [
    {
        "robot": {
            "robot_pose": (position, orientation),
            "joints_state": {
                "positions": array,
                "velocities": array
            },
            "eef_pose": (position, orientation),
            "sensors": {
                "realsense": {
                    "rgb": (480, 640, 3),
                    "depth": (480, 640)
                },
                "obs_camera": {...},
                "obs_camera_2": {...}
            },
            "instruction": str,
            "metric": {
                "task_name": str,
                "episode_name": str,
                "episode_sr": int,
                "first_success_step": int,
                "episode_step": int
            },
            "step": int,
            "render": bool
        }
    }
]

Action Space (Franka)

Three formats are supported:

List[float]
{
    'arm_action': List[float],
    'gripper_action': Union[List[float], int]
}
{
    'eef_position': List[float],
    'eef_orientation': List[float],
    'gripper_action': Union[List[float], int]
}

Sensors

  • Franka: front-facing view / gripper first-person view / rear-side view
  • Aloha: head view / left gripper first-person view / right gripper first-person view
Example: Franka & Panda front-facing view
Example: Franka & Panda gripper first-person view
Example: Franka & Panda rear-side view

Metrics

The primary metric is success rate:

  • Soft success: partial task completion counts as partial success (used in the competition)
  • Hard success: only full completion of all subtasks is considered success
Gesture-based teleoperation system architecture overview
GR00T N1.5 baseline (seen)
Gesture-based teleoperation system architecture overview
GR00T N1.5 baseline (unseen)

Additional Components

  • Developed a custom recorder for asynchronously capturing image frames and logging both state and image data at each timestep, improving runtime efficiency.
  • Implemented support for batch evaluation across multiple environments or parallel instances of Isaac Sim.
  • A full list of configurable parameters and additional features can be found in the official documentation.

InternManip Integration

The above evaluation environment is integrated as a benchmark module in InternManip.

Main implementations include:

  • wrapper environment
  • evaluator
  • Ray-based parallel evaluation
  • agent / model integration interface

The evaluation functionality in InternManip has been refactored into InternManip-Eval, which is introduced in a separate project page.