Beyond Words: LLMs and the Dawn of Physical AI

Two Worlds About to Collide

April 9, 2025

Join 200+ CIOs, CDOs, Directors, and decision-makers as we unpack how AI is truly transforming

our industries—not through buzzwords or headlines, but through hard science.

As an engineering-led consultancy, we go deep into the research and bring it back

to real-world applications we see today in Oil & Gas,

Utilities, Supply Chain and critical industries.

This week, Bilel Said (Senior ML Engineer) explores the pitfalls and limitations of large language models (LLMs) — and makes the case for why we should invest in, and stay curious about, systems that learn within contextual environments. Heads up, this article touches some technical concepts - if you get lost, no worries - we’ve got you covered in the ‘simply put section’



This is Bilel:

In February 2024, a robot named Figure 01 made itself a cup of coffee.

It wasn't following pre-programmed instructions. Instead, it was having a conversation with a human, asking questions when uncertain, and adapting to the specific environment in real-time.

What made this possible? The integration of a large language model (GPT-4) with a physical robot—a glimpse into what happens when AI steps out of the digital realm and into our physical world.

The Power and Limits of Language Models

Large Language Models (LLMs) like ChatGPT, Claude, and Gemini have transformed how we interact with technology. These systems process and generate human language with remarkable fluency, having been trained on trillions of words’ worth of human knowledge.

But there's a fundamental limitation to what I call "digital-only AI":

LLMs are like brilliant authors who have read every book ever written but never left their study. They can write eloquently about the beach—the sound of waves, the feel of sand between your toes—without ever having experienced it.

This disconnection from physical reality creates surprising blindspots:

  • An LLM can write a detailed recipe for baking bread, but it cannot taste, smell, or feel the dough.
    It can describe how to balance a broom, but it has no intuitive understanding of gravity or momentum.
  • It can define "soft" without ever experiencing softness

As Rodney Brooks, former director of MIT's Computer Science and Artificial Intelligence Laboratory, puts it: "Language models give us a glimmer of intelligence, but true AI will need to be embodied in the physical world to understand it the way we do."

Simply put: LLMs are powerful pattern recognizers, but they lack grounding in the real-world and reasoning, leading to blind spots.

The Rise of Physical AI

While LLMs have dominated headlines, remarkable progress has been happening in robotics and physical AI:

  • Boston Dynamics' Atlas has demonstrated parkour, gymnastics, and complex manipulation
  • Tesla's Optimus prototype has shown promising dexterity at a fraction of competitor costs
  • Figure AI secured $675M in funding to develop commercially viable humanoid robots

Yet these physical systems face their own limitations:

  • They require enormous energy (hundreds to thousands of watts versus the ~20 watts of a human brain)
  • Their dexterity remains far below human capability—achieving only 30–60% of human performance in fine manipulation.
  • They struggle to generalize from one task to similar tasks

As one roboticist told me: "Current robots are like toddlers with the strength of adults – physically capable but still learning basic coordination and common sense."

Simply put: Robots are becoming physically capable but still lack the cognitive flexibility and adaptability of humans.

The Integration Has Begun

What happens when you combine the reasoning capabilities of LLMs with robots that can physically interact with the world?

We're beginning to find out:

  • Figure AI has integrated GPT-4 with their humanoid robot to enable natural language instruction-following
  • Tesla is combining their Optimus platform with language understanding for more intuitive control
  • Sanctuary AI's Phoenix robot uses language models to translate human commands into physical actions

This integration addresses fundamental limitations of both technologies:

When combined, large language models and physical robots create a powerful synergy: LLMs contribute high-level reasoning, contextual understanding, task planning, and broad knowledge, while robots bring real-world interaction, sensory perception, physical manipulation, and environmental feedback.

Simply put: LLMs and robots are being merged to create AI systems that can both think and act, addressing the limitations of each.

The Economic Impact Is Coming

The numbers tell a compelling story about what's at stake:

  • McKinsey estimates that generative AI could add $2.6–4.4 trillion annually to the global economy
  • The industrial robotics market alone is projected to reach $116.8 billion by 2030
  • Oxford Economics projects 20 million manufacturing jobs could be replaced by robots by 2030
  • Gartner predicts that by 2026, the shortage of skilled robot technicians will reach 85%

Don't Be Fooled by the Videos

A word of caution: viral videos of robots often showcase carefully orchestrated demonstrations. The reality is more nuanced:

  • Most demonstrations show robots performing specific pre-programmed routines
  • Tasks that seem simple to humans (like folding laundry) remain enormously challenging for robots
  • A child can learn to tie shoes after a few attempts; robots struggle after thousands of trials

Simply put: However - despite impressive demos, real-world robot intelligence and adaptability remain early-stage.

The Road Ahead

Researchers are actively working to bridge the gap between language understanding and physical capability:

  1. Grounded language learning: Teaching language models to understand words by connecting them to physical sensations and interactions
  2. Foundation models for robotics: Creating pre-trained models for robotics similar to how LLMs function for language
  3. Sim-to-real transfer: Training robots in simulations before deploying to physical systems
  4. Common sense physics: Developing AI that understands intuitive physics as humans do

Simply put: Next-gen AI will learn from real-world interactions to achieve human-like understanding and adaptability.

Why This Matters for Everyone

The integration of LLMs with physical AI isn't just another technological advancement—it represents a fundamental shift in what AI can do in our world.

Near-term applications include:

  • Warehouse automation (2024-2026): 50-70% reduction in manual labor for picking and packing
  • Elderly care assistance (2025-2028): Support for aging in place, medication management
  • Construction automation (2025-2030): 15-30% increase in efficiency, addressing labor shortages
  • Home assistance (2026-2032): General-purpose robots for household tasks

As Fei-Fei Li, co-director of Stanford's Human-Centered AI Institute, notes: "The future of AI is not just large language models, but systems that can see, hear, touch, and interact with the world as humans do."

Simply put: This shift to physical AI will have tangible benefits in daily life, reshaping work, caregiving, and household tasks.

Mathematical Illustration

"If you are interested in human-level AI, don't work on LLMs." Yann LeCun, Chief AI Scientist at Meta

The core of a language model is the ability to predict the probability of the next token given a history of previous tokens.

Formally:

p(wi|w0,w1,...,wi−1)

Unlike humans, a large language model (LLM) doesn’t “think” about what to say next. Instead, it calculates the most statistically likely next token based on patterns it has learned from trillions of word sequences during training.

This is pattern recognition at scale—not consciousness, and not reasoning.

Meanwhile, research is progressing beyond transformer-based architectures, aiming to achieve human-level intelligence through systems that can learn by interacting with their environment. These new approaches are fundamentally different from the attention mechanisms that revolutionised natural language processing.

Take the JEPA (Joint Embedding Predictive Architecture), introduced by Yann LeCun, for example. It focuses on forecasting future states in an abstract representation space, not the raw data space. This allows the model to capture underlying structure and relationships in the data, enabling better generalisation and a deeper form of understanding.

At the heart of this approach is the idea of minimising an energy function:

E(x,y,z;θ)

The goal is to make the network intelligent enough to interact with its environment and learn from its experiences, much like a human.

This kind of adaptive intelligence is essential for the next generation of smart humanoids and autonomous robots, which will need to understand context, anticipate changes, and make decisions in real time. Unlike current systems that rely on massive datasets and rigid programming, these future agents must learn continuously, reason abstractly, and operate safely in dynamic, unpredictable environments.

Simply put: LLMs predict tokens; newer architectures aim for real world understanding through modelling environments and relationships abstractly.

The Historical Context

To appreciate where we're headed, consider how we got here:

Era Focus Physical Capabilities Early AI (1950s-1980s) Rule-based systems Fixed-program industrial robots Statistical AI (1990s-2010s) Machine learning More flexible industrial robots Deep Learning (2010s-2020) Neural networks Better perception, still limited Foundation Models (2020-Present) LLMs, multimodal AI Beginning integration with robotics Embodied AI (Emerging) Physical intelligence General-purpose physical AI

Simply put: AI is moving from abstract problem-solving to systems that perceive, move, and act in the physical world.

What Comes Next

The transition from LLMs to embodied AI will be like the evolution from books to interactive experiences—moving from passive information to active engagement with the world.

This won't happen overnight. The technical challenges are immense, from energy efficiency to mechanical design to software integration.

But the direction is clear: AI is stepping out of the digital realm and into the physical world.

The question isn’t whether this transformation will happen, but how quickly—and whether we’re prepared for a world where AI doesn’t just talk, but acts.

Let’s start a conversation

Get In Touch

Interested in finding a solution with Serious AI?

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.