The Era of Embodied AI - From modular design to end-to-end system
Deep conversation with Dr. Benjamin Eisner, a ML/Robotics Researcher from Carnegie Mellon University (CMU), a world-class institution for computer science. We explored the evolution of building robot brains from the era driven by mathematical equations to the age where robots learn via End-to-End systems and investigated why a simple human task like folding clothes remains a monumental challenge for robots.
"This is, like, you know, I've been doing this for, I guess, 7 or 8 years, maybe research in general for 10, and this is, by far the most exciting time for robotics, that I've seen, seen so far, in terms of both the way that technology is progressing, but also the way in which, like, the rest of the world is starting to pay attention. We're starting to see really cool results, and, you know, the kind of excitement around industry and commercial applications is really exciting, too."
The Evolution of Robotics
Dr. Benjamin began by tracing the roots of robotics. In the past, building a robot wasn't about AI; it was purely engineering and physics. In the early days, or what we might call the Modular Design era, robots were driven by Control Theory and rigid physics equations written by humans. The system was clearly divided into perception, planning, and actuation. This method was highly successful in controlled, predictable environments like robotic arms in car assembly lines repeating tasks with millimeter precision but it had limitations. It couldn't handle change; if the environment shifted even slightly, the system could fail immediately.
Over the past 10–15 years, we began seeing a shift toward a hybrid era integrating Machine Learning. A clear example is the evolution of self-driving cars. Leading tech companies like Waymo started using Deep Learning to solve perception problems, enabling cars to distinguish trees, people, or other vehicles. However, the mechanical brain responsible for decision-making and control largely relied on traditional Model-based systems.
The real turning point occurred in the last 5 years, as AI technology leaped forward to the concept of End-to-End Systems. This involves using massive Neural Networks to control the entire process from receiving images via cameras to commanding motors to move without relying on step-by-step physics equations written by humans.
Dr. Benjamin explains that the robotics world is shifting to End-to-End systems because "the real world is too complex for physics equations to fully explain." This is especially true for friction and contact. Imagine asking a robot to fold clothes or handle fluid objects like sand. Writing code to calculate forces on objects that constantly change shape is practically impossible. End-to-End systems unlock this limitation by allowing AI to learn these physics properties through data instead of programmed commands.
However, Dr. Benjamin insisted that Modular Design isn't gone. If a job has a clear structure and needs 100% accuracy, like making 100 bars of soap a minute, the old system is still better. It is verifiable and accurate, unlike AI, where we sometimes don't know what it is thinking.
Imitation and Trial-and-Error in Robots
When robots have to learn on their own, how do we teach them? Dr. Benjamin compares two main processes currently used
- Imitation Learning: This is like showing a robot a demonstration video of human actions and having it try to copy them. The advantage is that the robot learns basic movements quickly. The downside is a lack of deep understanding like someone watching tennis tutorials and memorizing the moves but failing on the court because they can't judge timing or force without the feedback of actually hitting the ball.
- Reinforcement Learning (RL): This is comparable to letting the robot learn through trial and error. Success brings a reward; failure brings a penalty. This method helps robots achieve true mastery but comes at a very high cost in time and resources. A robot might fail millions of times before getting good, which is unacceptable if tested on expensive physical hardware.
The current solution is a Hybrid Approach, similar to how ChatGPT was trained. It starts with Imitation Learning to give the robot a foundational understanding, followed by Reinforcement Learning to fine-tune those skills for precision and sharpness.
When Robots Begin to Have Imagination and Understanding
A key highlight involves Dr. Benjamin's research: FlowBot3D and TaxPost, which attempt to address understanding and common sense in a robotic context.
FlowBot3D: Teaching Possibility
Normally, a robot might be taught "this is a door." FlowBot3D goes further by teaching the robot to understand the mechanism of Articulated Objects through vision. The research team trains robots to analyze geometry for instance, seeing a hinge or a handle triggers a "physical imagination" of how the object moves and in what direction. This allows the robot to open cupboards, ovens, or doors in unfamiliar locations immediately without pre-programming.
TaxPost: Teaching Spatial Relationships
If FlowBot3D is about understanding how an object works, TaxPost answers what to do by learning from observing humans. For example, when clearing a table, humans don't just place plates down randomly; they stack them. Robots must learn these Spatial Relationships to work alongside humans naturally and meaningfully.
Sim-to-Real Gap and the Data Scarcity Crisis
No matter how smart the AI, the biggest challenge when embodied in a robot is the Sim-to-Real Gap, the disparity between the simulated world and the real world. Dr. Benjamin explains that while simulators have improved significantly, simulating micro-physics like contact forces, friction, or even motor delay upon command remains imperfect. A robot that is a genius in a simulation often becomes clumsy in the real world.
We also face Data Scarcity, which Dr. Benjamin illustrates with a data triangle:
- The Base: Internet data (YouTube, photos) massive quantity but unusable for direct robot control.
- The Middle: Simulation data can be generated in bulk and aids visual diversity, but lacks physical realism.
- The Peak: Real-world data the most critical part for finishing the job, but the scarcest and hardest to find.
Currently, many companies are solving this by building "robot farms" where humans remotely control robots 24/7 to collect accurate movement data to teach the AI.
Where Will Robots Be in 3–5 Years?
When asked about real-world adoption, Dr. Benjamin believes we won't see The Jetsons style maid robot doing everything just yet. However, Embodied AI will permeate two main areas
- Industrial & Logistics: Robots won't just lift heavy loads; they will be more dexterous. They will handle objects that stumped older robots, like grabbing floppy plastic bags, unpacking boxes, or picking agricultural produce. This will fill gaps in production processes that previously required human hands.
- Consumer Market: Robots will enter homes as high-tech toys or specialized assistants such as robots that pick up children's toys or simple robotic arms on mobile bases. While they won't perfectly perform complex chores like washing dishes or folding clothes yet, they will be the starting point for people to get used to having robots moving around their homes.
The conversation with Dr. Benjamin reveals that the era of Embodied AI is not just a software upgrade; it is a revolution in how computers interact with the physical world. We are moving from systems that strictly follow commands to systems that learn, adapt, and understand the physics of the world through their own experience. Although challenges lie ahead, the direction of innovation clearly indicates that robots in the near future will possess a "human-ness" in terms of learning and movement far beyond what we previously imagined.
Watch the full episode here https://youtu.be/SXFZc5d0bGs?si=7d0KIKX0MVA3hGyT





