Project Overview

This is an exploratory hobby project in which we attempt to build an “intelligent toy robot” that exhibits a certain level of cognitive and behavioral capabilities in real-world environments, including:

Autonomous task planning and behavior control
Environmental perception and semantic understanding
Memory-based personalized interaction capabilities
Additional components: multimodal knowledge base, safety QA module, motion control system, indoor navigation, etc.

Since this project shares a significant portion of its underlying modules with previous work, we only focus here on the newly introduced and enhanced components.

System architecture diagram — Early system architecture design

Autonomous Task Planning

Building upon the existing system, we designed and improved a more autonomous task planning module.

During the planning process, multiple SLM agents collaboratively analyze and reason about the current state, continuously generating the next action steps until the system determines that the task has been completed or can no longer be progressed.

This mechanism enables the robot to exhibit more stable goal-directed behavior in open environments.

Environmental Perception

The system was extended with visual understanding capabilities, allowing the robot to perform semantic-level analysis of its surroundings, with visual outputs serving as important inputs for subsequent task planning and tool selection.

At this stage, VQA is treated as a “tool module” that can be invoked by the robot’s brain as needed.

Autonomous behavior driven by vision and instructions

In later iterations, we adopted a VLM to directly participate in the decision-making process, internalizing the perception-to-decision pipeline into a unified model, which resulted in more consistent and robust behavior.

Memory System

We introduced a long-term memory mechanism for the robot, enabling event-level recall and information integration, thereby supporting more proactive interaction and planning behaviors.

A key design choice is a “human-centered” memory model: the system can recognize individual users and maintain separate memory spaces for each of them, including:

Long-term attribute memory: relatively stable information such as names and preferences
Short-term episodic memory: time-sensitive interaction context and conversational history

With this structure, the system gradually forms personalized understanding over multi-turn interactions, achieving a more natural “the more we talk, the better it understands you” experience.

Identity-based dialogue and memory demonstration

Other Modules

The system also includes several standard components, such as a multimodal knowledge base, safety QA module, motion control system, navigation capabilities, and personalized configuration features.

These modules are largely reused from prior projects and have been adapted and enhanced for this system; therefore, we do not elaborate on them here.

Giving a Toy Robot a Brain

Project Overview

Autonomous Task Planning

Environmental Perception

Memory System

Other Modules