The AI Can Already Talk. Now Give It a Body — and Watch What Happens.

A humanoid robot facing a person in a warm home environment

The conversational AI moment already arrived. I wrote about it recently — I spent several hours testing Claude, Gemini, and ChatGPT's voice modes back-to-back, and by the end I was having what can only be described as genuinely good conversations. Not "impressive for a machine" conversations. Just good ones. The kind where you forget, for stretches, that the other party doesn't breathe.

ChatGPT won that test. Its voice mode felt alive in a way the others didn't — natural interruptions, real emotional register, a sense of actual listening. If you haven't read the full breakdown, I tested all three so you don't have to. But I want to pick up where that article left off, because there was a thread I didn't pull on: what happens when that same conversational intelligence gets a physical body?

Because right now, we're talking to a box. And that matters more than we usually admit.

The Box Problem

I have an old friend who moved to another city a few years ago. We still talk on the phone. The conversations are good — warm, occasionally funny, sometimes serious. But there's something different about seeing him in person. Something the phone cannot replicate no matter how clear the audio.

That's not nostalgia. It's closer to neuroscience.

Humans evolved in a world where every entity that mattered was physical. Predators had bodies. Tribe members had faces. Resources had weight. For hundreds of thousands of years, presence meant physical presence — and our brains built deep systems to respond to that. We read body language automatically. We mirror posture without thinking. We feel the calming effect of physical proximity to someone we trust. We are wired, at a very low level, for faces and gestures and the shared occupation of a space.

None of that fires when you're talking to a speaker on a desk.

I genuinely enjoy my conversations with ChatGPT voice mode. I find them useful and often engaging. But I am also aware, every single time, that I am talking to a box. The interface is invisible. There's nothing to look at. Nothing that moves, or tilts its head, or meets your eyes. For humans who have spent their entire lives building meaning through faces and presence, that gap is real — and it creates a ceiling on how much the technology can actually do.

Conversational AI crossed a threshold. The conversation is real. But the container is still a box — and the container matters more than we admit.

Three Categories of Physical AI That Already Exist

The gap is starting to close. Not all at once, not in a single product, but across three distinct categories that are already moving from labs into the real world.

Humanoid robots. The flashiest category. Figure AI, backed by Microsoft and OpenAI, demonstrated something remarkable in early 2024: a full-scale humanoid robot having an unrehearsed conversation with a human, answering questions about what it could see, reasoning about tasks in real time, and using its hands to pick up objects — all powered by OpenAI's language models running on-device. Tesla's Optimus is walking factory floors. Boston Dynamics' Atlas, which most people know for its acrobatics, is now controlled by learned AI policies rather than hand-coded motion scripts. These robots are not science fiction anymore. They are a few years from being in homes.

Desktop companions. Smaller, less imposing, designed to sit on your desk or coffee table. Eilik is a small expressive robot about the size of a large matchbox — it responds to touch and voice with genuine personality shifts. Vector, originally from Anki, was an early prototype of this idea: a tiny robot with a face, curiosity routines, and simple speech recognition. The category is still underdeveloped, but it's pointing at something important. You don't need a full humanoid to get the embodiment benefit. A face and a few expressive movements can be enough to completely change the feel of the interaction.

Companion and therapeutic robots. PARO is a baby harp seal robot developed by Japan's National Institute of Advanced Industrial Science and Technology. It moves, it makes sounds, it responds to touch and voice. No screen, no keyboard, no traditional interface — just a soft physical presence. And it has been used in hundreds of care homes across Japan, Europe, and the United States with results that are, frankly, striking. A 2007 study by Wada and Shibata published in IEEE Transactions on Robotics found that elderly residents who interacted with PARO showed significant reductions in stress hormones, depression scores, and anxiety — outcomes comparable to animal-assisted therapy, which requires actual animals. PARO isn't particularly intelligent. Now imagine what PARO becomes when you connect it to a language model that can actually talk back.

What the Research Actually Says About Embodiment

There is a body of work in cognitive science called embodied cognition — the idea that intelligence isn't a disembodied calculation happening in isolation, but something fundamentally shaped by having a body that acts in the physical world. Andy Clark, in his 1997 book Being There: Putting Brain, Body, and World Together Again, argued that mind, body, and environment are so deeply coupled that separating them produces a fundamentally incomplete picture of intelligence. The physical is not incidental to the mental. It's constitutive of it.

For AI, this is increasingly being validated in practice. The field of physical AI — building systems that don't just process but act in the world — is accelerating partly because researchers found that embodied learning is more efficient and more transferable than purely computational learning. A robot that has to navigate a real room develops generalizations that a model trained only on text cannot.

But the embodiment argument applies equally to the human side of the interaction.

Kerstin Dautenhahn, one of the leading researchers in social robotics, documented extensively how humans respond differently to robots depending on their physical form — and specifically how physical social cues (eye contact, proximity, gesture, facial expression) activate different cognitive and emotional responses than text or voice alone. Her 2007 survey in Philosophical Transactions of the Royal Society B laid out the dimensions of socially intelligent robot design, and the core finding holds: physical presence is not cosmetic. It changes the interaction at a fundamental level.

Cynthia Breazeal's work at MIT — starting with Kismet in the late 1990s and continuing through decades of social robotics research — repeatedly demonstrated that robots with expressive faces and social behaviors elicit genuine emotional responses from humans, including from adults who know perfectly well they're talking to a machine. The face matters. The gesture matters. The directionality of gaze matters. We know these things intellectually. We evolved to respond to them instinctively.

A 2007 study found PARO — a robot seal with no real intelligence — produced outcomes comparable to animal-assisted therapy. That's what a body does, even without a brain behind it. Now imagine the brain arriving.

The Convergence Moment

Until recently, two worlds were mostly separate. Social robotics had physical presence but limited intelligence — PARO cannot hold a conversation. Conversational AI had intelligence but no physical presence — ChatGPT voice mode cannot look at you.

That separation is ending.

What Figure AI demonstrated in 2024 is what happens when you take a physical robot — one that can walk, see, and use its hands — and connect it to a language model that can think, reason, and talk. The robot answers questions about the room it's in. It identifies what it can see. It decides which object to hand you and why. The conversation is real. The body is real. The intelligence is real. What you get is something qualitatively different from either a robot or a chatbot in isolation.

This is the convergence: conversational AI and physical AI meeting inside the same body. And if the conversational half of that equation is already at the level I described in my voice AI test — genuinely good, genuinely surprising — then what happens when it gets the physical half?

Who This Changes First

Not everything will transform at once. But some areas are going to feel this faster than others.

Elder care. PARO already has FDA medical device clearance in the United States. The leap from PARO to an AI companion that can hold real conversations, remember your preferences, notice when you seem sad, and actually respond with words is not a huge engineering leap — it's a connection waiting to be made. For the tens of millions of aging adults facing a genuine loneliness crisis, and for a care sector that has a severe workforce shortage, a physical AI companion that can talk and be present is not a gimmick. It's a genuine answer to a real problem.

Children and education. Kids form attachments to physical objects in ways that adults have been largely trained out of. A physical AI tutor that a child can talk to, look at, and interact with in three dimensions is a fundamentally different object from a learning app on a tablet. The engagement level is different. The emotional relationship is different. There's a reason LeapFrog sold physical devices rather than software subscriptions — presence matters for children in ways it doesn't always matter for adults.

Mental health and companionship. This is the sensitive one, and also the one where the stakes are highest. Companion AI is already being built, and physical companion AI is not far behind. The more interesting question is not whether it will happen, but whether it gets built with the research in mind — what actually helps people, what creates dependency, what the right design boundaries are.

Retail and logistics. Humanoid robots that can carry, navigate, and hold a conversation are already being tested in warehouses and factories. The next step — and several companies are actively building it — is a robot that can greet customers, answer questions, and guide you through a space. The conversational piece was always the bottleneck. It isn't anymore.

The Quiet Shift

We spent years debating whether AI voices would ever feel real enough. They do now. ChatGPT voice mode crossed that threshold — I tested it, and I felt it. The conversation was genuinely good.

But we talk to each other in bodies. We grew up with parents who had faces. Our first relationships were tactile — we were held, and we held things back. The idea that intelligence could be fully disembodied is a very recent and very strange idea in the history of how humans relate to each other and to the world around them.

Physical AI won't solve everything. It introduces its own complications — questions of trust, dependency, ethics, what it means to form a relationship with something that isn't alive. None of that is trivial and none of it should be dismissed.

But the physics of human psychology are not going to change. A box will always be a box. Give the intelligence a body — give it a face that can look at you and a presence that actually occupies the room — and you're not just improving the interface. You're tapping into every social and emotional instinct humans have built over hundreds of thousands of years of evolution.

That's not a small upgrade. That's a different product entirely. And it's closer than most people think.

Jaime Delgado

Product Analyst & AI early adopter

Jaime has been tracking the AI landscape since the GPT-3 era. He writes about AI capabilities, model comparisons, and practical applications for builders and founders. His daily driver is Claude inside Visual Studio Code — though he also reaches for Grok, Gemini, and ChatGPT when the question is quick and the context is light. He stays genuinely open to every AI that comes along: the landscape moves fast, and so does he. Based in Spain.

View on LinkedIn