Embodied Intelligence Theory originates from “Embodied Recognition,” which posits that the cognitive abilities of all intelligent agents, including humans, are determined by their physical structure. Based on this, agents construct their own world models. This cognition directly influences high-level psychological activities of intelligent agents, such as reasoning and decision-making.
Embodied Intelligence: The “Exploration-Exploitation” Learning Paradigm
From the mechanism of cognition generation to the world model that agents rely on for decision-making, everything is constrained by the agent’s specific physical form. Embodied Intelligence Theory challenges various cognitive theories, including Cartesian dualism, and establishes a unified theoretical framework that integrates the “body” and “intelligence.” It views the intelligent agent and its surroundings as a system, interacting with the external environment through its “body” and obtaining information from feedback generated by the environment’s response to the agent’s actions. The entire cognitive process follows the “exploration-exploitation” paradigm.
Psychologist Eleanor Gibson explained infants’ spatial cognition and understanding process through Embodied Cognition Theory.
Embodied Intelligence Theory emphasizes a strong correlation between the intelligent agent and the environment, with “intelligence” essentially being the sum of these two entities. To achieve this summation, embodied intelligent agents must possess some fundamental, universal capabilities, including:
Spatial Cognition:
Spatial cognition is one of the basic abilities of intelligent agents in this world. This process involves the agent first “deconstructing” the physical entities of the external world and then “constructing” an abstract geometric model of the external world at the psychological level.
Mobile Navigation:
If spatial cognition represents the agent’s abstraction of the macro world, mobile navigation reflects the agent’s adaptation to its immediate micro-environment. Through the “exploration-exploitation” learning paradigm, agents discover knowledge, accumulate experience, enhance intelligence, and succeed in natural evolution.
Hardware for “Intelligence” in Embodied Intelligence
For a long time, the development of artificial intelligence has mainly focused on achieving better intelligence on general-purpose hardware without adequately considering how to drive the development of new hardware based on intelligence needs. Perhaps it is not an exaggeration to elevate the principle of “hardware born for intelligence” to the level of a guiding principle in the development of embodied intelligence. Judging from recent industry trends, this trend has already begun to emerge. Predictably, the development of embodied intelligence will extend this principle to the design and production of all hardware, prompting applications to shift from being “spontaneous” to being “conscious.”
The evolution of computing chips, from CPU to ASIC, exemplifies the spontaneous phenomenon under the principle of “hardware born for intelligence.”
Hardware based on Embodied Intelligence shares commonalities in basic functional requirements. These can be summarized as: multi-modal environmental cognition and interaction capabilities through language and visual means; intelligent task learning and understanding, converting tasks into internal structured representations based on internal world models; highly autonomous intelligent decision-making capabilities to address new and unexpected situations within the agent’s hardware and software system; efficient single-task execution with minimal or occasional human intervention; and intelligent multitasking switching capabilities.
Embodied Intelligence Theory clarifies the coupling between intelligence and the body, discussing the relationship between the intelligent agent and the environment. Therefore, when designing embodied intelligent products, it is crucial not to isolate them from their task environment.
Embodied Intelligent Industrial Robots (EIIR)
Early industrial robots were primarily used for repetitive, singular tasks. With the advancement of technology, they gradually became automated, capable of performing more complex and delicate tasks. However, industrial robots are now encountering technical bottlenecks, facing challenges such as cost, complexity, flexibility, and human-robot collaboration.
Meanwhile, artificial intelligence technology has developed rapidly, with deep learning techniques achieving breakthroughs in speech and image recognition, natural language processing, and other tasks. The recent development of multi-modal large model technology has laid the foundation for realizing natural human-machine interaction. “Artificial Intelligence +” has become an actively explored field.
On one side, there is a mature industry facing bottlenecks and eagerly seeking new developments. On the other side, there is an emerging technology eager to find its application with the aura of change and disruption. Our inherent human intuition tells us that this is a moment that urgently demands the fusion of the two, yielding the most stunning outcome: Embodied Intelligent Industrial Robots (EIIR)!
Guided by Embodied Intelligence Theory, the integration of mature industrial robots with emerging artificial intelligence technologies has given birth to “Embodied Intelligent Industrial Robots” (EIIR).
The Essence of EIIR: Liberating and Surpassing “Humans”
The survival environment of EIIR is the industrial production environment. Fully examining this environment can help us identify the appropriate form for EIIR. From this, we can conclude that the humanoid robot is not the ideal form for EIIR. Compared to the natural environment, the production environment is closed and simple. Logically, the “humanoid” form, which evolved in an open environment, is not naturally the best bodily form for a closed environment. Secondly, the production environment is artificially designed and manufactured. If machines can autonomously complete tasks without human involvement, the corresponding production environment can be designed to be more machine-friendly, eliminating the need to consider the limitations of human form and making the production process more efficient and reliable.
EIIR needs to replace the projection of human alienation in the production process, not the essence of humans or their physical appearance. Furthermore, EIIR needs to amplify and enhance the human abilities it replaces in the production environment, leveraging its machine attributes to achieve performance beyond that of humans. This makes it possible and necessary to liberate humans from production activities. Compared to precise automated control, EIIR can better achieve truly unmanned production due to several factors:
Uncertainty in Production Scenarios:
From a qualitative perspective, industrial scenarios are macroscopically closed and bounded. However, at the micro level, industrial scenarios also present many uncertainties, making them quantitatively open environments. This requires agile intelligence to address such uncertainties.
Varying Boundaries of Production Environments:
Different production tasks have corresponding and specific production environments, and the boundaries of these closed environments vary. The possibility of switching between production tasks is open and almost unlimited, requiring a sufficiently high level of intelligence or minimal human assistance to facilitate this environmental switching and adaptation.
Standard Products with Standard Intelligence:
Standard products possess a certain level of standard intelligence, enabling them to learn specific production tasks with low time and labor costs when deployed in specific production scenarios. This adaptability to different production scenarios makes large-scale applications possible.
The Appearance of EIIR: Concrete Embodied Intelligence
If there are no significant doubts about the essence of EIIR, let’s imagine its appearance. Embodied Intelligence Theory suggests that intelligent agents consist of three components: a perception system, a motor system, and a world model. This framework remains applicable to EIIR.
Perception System – Multi-modal Pan-sensor System:
By selecting and configuring sensors appropriately, supplemented by efficient and intelligent data algorithms, a perception system significantly stronger than that of humans can be established. This system continuously and uninterruptedly perceives the state of both the surrounding environment and EIIR itself, providing precise information for decision-making. For example, in the challenging field of industrial inspection, such as appearance defect detection, advanced machine perception capabilities can identify and analyze object poses and features, autonomously generate inspection sequences, and track and detect indefinite and varying defects with high-precision image sensors. This enables flexible and superior defect detection capabilities compared to humans. Based on this, modeling is performed using dynamic principles to “recognize” its own capabilities through information feedback, which is updated in real-time.
Motor System – Closed-loop Control System:
By integrating and merging upper and lower systems, state feedback and control can be jointly processed, computed, optimized, and coordinated to meet the requirements of flexibility, precision, and speed. Taking the “joint motor” as an example, its “visual servo” system consists of multiple controllers nested and combined hierarchically. Each level has its own control indicators and objects that need to be optimized. From the overall to the local level, closed-loop control is gradually refined. For instance, a proprietary multi-axis real-time control system, combined with dynamic and kinematic algorithms, calculates the optimal motion trajectory in terms of time and state. It employs millisecond-level speed closed-loop motion control and completes closed-loop motion planning using an image model with a 10-millisecond speed.
World Model – Summary and Abstraction of the Production Environment:
The world model is a cognitive framework constructed by intelligent agents based on their structural characteristics to explain the world. It is dynamically changing, and every interaction between the agent and the environment affects it to some extent. Large model technology, combined with industrial data, provides a shared basic version for the world model, tentatively called the “basic world model.” This “basic world model” endows EIIR with powerful comprehension abilities, enabling it to exchange information with humans through familiar patterns. The way humans train EIIR has also undergone fundamental changes. By using natural language, images, videos, action demonstrations, etc., an interactive mode of “demonstration-learning-feedback” can be established with EIIR. Knowledge can be transferred through multiple rounds of dialogue. This continuous learning process continues throughout the entire lifecycle of EIIR.
The key to EIIR: intelligent flexible adaptation
How to make standard and general EIIR products quickly have the ability to perform specific production tasks, or how to easily transfer human professional skills to EIIR, the core lies in achieving “intelligent flexible adaptation” through “human-computer interaction”. Taking ChatGPT as an example, it established an efficient communication method between humans and machines for the first time, fundamentally breaking the barriers of human-machine communication and changing the paradigm of human-machine interaction.
EIIR supported by large models will completely reverse the relationship between man and machine. Humans can communicate with EIIR in their own customary ways, such as natural language, body language, movements, behavioral demonstrations, etc., fundamentally breaking the semantic isolation between humans and machines. In terms of software, the support of large models gives EIIR the ability to learn quickly and ensures intelligent flexibility. With the development of chip technology, the functional boundaries between software and hardware will become blurred, and the trend of “software hardening” will become increasingly obvious. With more powerful computing power and integration density, EIIR’s computing power density will also achieve a qualitative improvement. In terms of mechanical configuration, the wide application of new materials and new technologies will provide EIIR with more different external forms, and even adjust the mechanical structure in real time according to the requirements of the task. This ability most faithfully restores the fundamental requirements of the embodied intelligence theory and achieves the deepest integration of intelligence and the body.
⌈EIIR, the future is here⌋
The birth and historical mission of EIIR is to take over the production of material materials in human society and provide continuous material support for human development. This is also its only historical destination. As a machine, with the advancement of technology, the development of EIIR is bound to be gradual. In the early stage, it will coexist with humans in the same production environment for a long time. As technology develops, its level of intelligence will become higher and higher, and more and more tasks will be completed independently without the need for collaboration with others. In the advanced stage of development, a true “unmanned factory” will be realized. At this stage, the organizational form of factories and production lines will be completely different from now, and human beings will also be completely liberated from the material production that alienates them. The role this will play in the development of human society is immeasurable and will greatly accelerate the pace of human self-liberation. Admittedly, this will be a long process, but it is worth looking forward to and worth our efforts, because it will eventually come!
In order to promote the improvement and innovation of the humanoid robot industry chain, while promoting technology research and development and the expansion of application scenarios, the Mobile Robot Industry Alliance and the Humanoid Robot Innovation Consortium hosted the Guangzhou (International) Humanoid Robot Application Scenario Ecological Conference. This conference gathers more than 200 upstream and downstream enterprises in the humanoid robot industry chain to actively promote the construction of a new ecosystem of intelligent manufacturing, intelligent services and intelligent life. Stay tuned!