Back

Gesture-Based Teleoperation for Manipulation

Project Overview

Introduction

This system is a core component of the InternUtopia project. It enables real-time teleoperation of a Franka robotic arm through monocular RGB camera-based hand gesture recognition, while simultaneously collecting trajectory data during operation.

Core Objective: A low-cost solution for collecting robotic arm trajectory data in complex or long-horizon tasks

Tutorial Documentation: this guide provides detailed setup instructions and gesture definitions.

Key Highlights:

  • Low-cost solution: Requires only a single RGB camera without additional specialized hardware
  • Intuitive control: Supports special gestures for:
    • Third-person view adjustment
    • Coordinate system recalibration
    • Motion precision tuning
  • Optimized performance: Improves human-robot motion semantic consistency with higher precision and real-time responsiveness
Right-hand-based robotic arm control demonstration
Left-hand-based viewpoint control demonstration

System Architecture

Gesture-based teleoperation system architecture overview
Overview of the gesture-based teleoperation system architecture

Hardware Requirements

Recommended configuration:

  • 2× NVIDIA RTX 4060 Ti GPUs (1 for Hamer, 1 for GRUtopia)
  • 1× RGB camera

Notes:

  • Single-GPU operation is supported, but with reduced frame rate
  • No strict requirement on camera type; USB webcams or built-in laptop cameras are both acceptable

Implementation Workflow (as in tutorial)

  1. Launch real-time video streaming server
  2. Initialize hand gesture recognition service
  3. Start Franka robotic arm control program
Launching real-time video streaming server
Initializing gesture recognition service and starting robotic arm control program

Gesture Description (as in tutorial)

  • Right hand: direct end-effector control
    • Thumb–index pinch/release: close/open gripper
  • Left hand: auxiliary control functions
    • Thumb–index pinch with motion: adjust third-person view
    • (See the full gesture definition in the tutorial)
Pick-and-place task demonstration