Project Overview
This is an exploratory hobby project in which we attempt to build an “intelligent toy robot” that exhibits a certain level of cognitive and behavioral capabilities in real-world environments, including:
- Autonomous task planning and behavior control
- Environmental perception and semantic understanding
- Memory-based personalized interaction capabilities
- Additional components: multimodal knowledge base, safety QA module, motion control system, indoor navigation, etc.
Since this project shares a significant portion of its underlying modules with previous work, we only focus here on the newly introduced and enhanced components.
Autonomous Task Planning
Building upon the existing system, we designed and improved a more autonomous task planning module.
During the planning process, multiple SLM agents collaboratively analyze and reason about the current state, continuously generating the next action steps until the system determines that the task has been completed or can no longer be progressed.
This mechanism enables the robot to exhibit more stable goal-directed behavior in open environments.
Environmental Perception
The system was extended with visual understanding capabilities, allowing the robot to perform semantic-level analysis of its surroundings, with visual outputs serving as important inputs for subsequent task planning and tool selection.
At this stage, VQA is treated as a “tool module” that can be invoked by the robot’s brain as needed.
In later iterations, we adopted a VLM to directly participate in the decision-making process, internalizing the perception-to-decision pipeline into a unified model, which resulted in more consistent and robust behavior.
Memory System
We introduced a long-term memory mechanism for the robot, enabling event-level recall and information integration, thereby supporting more proactive interaction and planning behaviors.
A key design choice is a “human-centered” memory model: the system can recognize individual users and maintain separate memory spaces for each of them, including:
- Long-term attribute memory: relatively stable information such as names and preferences
- Short-term episodic memory: time-sensitive interaction context and conversational history
With this structure, the system gradually forms personalized understanding over multi-turn interactions, achieving a more natural “the more we talk, the better it understands you” experience.
Other Modules
The system also includes several standard components, such as a multimodal knowledge base, safety QA module, motion control system, navigation capabilities, and personalized configuration features.
These modules are largely reused from prior projects and have been adapted and enhanced for this system; therefore, we do not elaborate on them here.