Physical intelligence
Robotics and the next frontier of AI, Part 1 of a series
Artificial intelligence has made stunning advances in the domains of language and images. The next front in its advance is likely to be just as consequential: mimicking the everyday, animal intelligence of embodied beings who are sensitive to their environment and act with finesse within it. Success in this realm would bring AI out of the screen and into the physical world. That can be disturbing to contemplate. But the conceptual insights that are driving advances in robotics are already consequential as a window onto the nature of intelligence, the deep facts of embodiment, and their intimate entanglement.
The ambition to build automatons has been with us for a long time. In Greek mythology, the smith-god Hephaestus was said to have made automatons for his workshop. The first cuckoo clock was made in Alexandria. In 8th century Baghdad, visitors were astonished to see wind-powered statues, and the author Al-Jazari “described complex programmable humanoid automata, amongst other machines he designed and constructed in the Book of Knowledge of Ingenious Mechanical Devices in 1206,” according to Wikipedia. For Descartes, the existence of automata was significant for his philosophical agenda; he believed animals were nothing more than complex machines.
In the modern era, the effort to animate machines, or to de-animate animals, has often been in the service of asserting the non-existence of the soul. Reliably, the limitations of the machines that emerge from such efforts are interesting in themselves. But the recent successes, too, illuminate the deep links between such disparate and apparently unrelated facts as these:
Animals (like us) are creatures who
-- have knowledge, however tacit and inaccessible to articulation this knowledge may be
-- have a finite amount of attention and other cognitive resources
-- have a definite physical boundary between self and not-self
-- move through the world by some characteristic mode of locomotion, within some ecological niche that affords certain actions and forecloses others. The particulars of our embodied mode of being in the world affect how we perceive the world.
-- have pragmatic interests (we are not indifferent to different outcomes). At a higher level of abstraction, it can be said that we care; we are directed toward ends.
There was a leap forward in robotics in the early 1990s that drew inspiration from a tradition in 20th century philosophy called phenomenology. One upshot of that tradition was a rejection of the mind-body dualism that had been asserted by Descartes. It was becoming clear that our capacity to apprehend the world could not be made sense of, if we regard it as confined to narrowly defined mental operations that are cleanly separable from the fact that we have bodies, as though a brain in a jar was an adequate image for thinking. Having a body makes one subject to all manner of contingent particulars that condition how we are situated in the world, in terms of perception and action and concern.
The seminal article for this phenomenologically-informed breakthrough in robotics was Rodney Brooks’ “Intelligence Without Representation” (1991). In his experiments in making “artificial creatures,”
The fundamental decomposition of the intelligent system is not into independent information processing units which must interface with each other via representations. Instead, the intelligent system is decomposed into independent and parallel activity producers which all interface directly to the world through perception and action, rather than interface to each other particularly much.
The motto of this approach is “the world is its own best model.” Recall the Slinky, that long and loose spring toy that can be set at the top of a flight of stairs. If you set it in the right initial posture spanning two stairs, it will proceed to “walk” down the stairs at a steady rhythm, without any further intervention. Its motion is a function of the height and width of the stairs (their spatial frequency in two dimensions) interacting with gravity and the spring rate of the Slinky. The control algorithm lies in this “fit” between the design and the environment, rather than in a symbolic representation of the environment that is then used to generate motor commands. Robots that exploit such “affordances” move more fluidly, consume vastly less energy, and require far less “compute”.
The latest wave of innovation in robotics takes its inspiration from the success of large language models. To grasp the significance of this success, and the departure it represents from previous efforts, it helps to have some recent history in view. “Good old fashioned artificial intelligence,” is it is called in cognitive science circles, sought to have an exhaustive account of the rules of language in all their permutations, and would eventually join this comprehensive grammar to an equally comprehensive lexicon. This seemed plausible enough as an aspiration: grammar plus vocabulary equals semantics. But this grammar-heavy approach of “symbolic computation” turned out to be a dead end. Among other difficulties, it seemed impossible to equip such an AI with “common knowledge,” meaning all the stuff we take for granted that never has to be specified because it just comes from living in the world and doing things in it. Lacking this agentic type of knowledge, early AIs would say things that are logically coherent but entirely absurd. Further, there seemed to be no finite library of facts, in the form of propositional statements (for example, “dogs are not able to jump over houses”), that such an AI could be equipped with that would give it the kind of common sense that we take for granted.
Around the same time that Rodney Brooks was experimenting with artificial agents that operate without symbolic representations, researchers in cognitive science became enthralled with “connectionist modelling.” The brain is a dense network of neural synapses. These are junctions where two or more nerves meet. At a first approximation, one can treat these synaptic connections as binary: an electrical signal either passes across the synaptic gap or it does not. The brain appeared to operate on a basis that resembled the zeroes and ones of a binary computer. Let us, then, try to replicate brain functions with a logic machine operating on similar principles. As early a figure as William James speculated that learning, as well as the formation of habits, may be understood as the effect of certain neural pathways being strengthened, in the sense that they become more prominent; more likely paths for electrical activity to follow in response to a given stimulus. His was perhaps the first articulation of the principle that would come to be called “neuroplasticity”.
At a finer grain than a crude on/off logic switch, and more closely resembling the biological inspiration, one can make the response of an artificial neuron variable, with an “activation function” that is non-linear as well as non-binary. This enormously enriches the combinatorial possibilities that can occur in a finite system. It comes to behave more like a “statistical” thing than like a simple logic device. In connectionist modelling, the idea is to build layer upon layer of variable-response nodes, and to vary the “weight” of the connections between nodes to achieve success of the model in replicating some rudimentary cognitive task, such as distinguishing the location of a dot on a screen. Whether (or to what extent) any given trial counted as a “success” had to be judged from outside the model, whether by some unfortunate graduate student or by some independent mechanical process.
Once that determination could be accomplished from within the model, they became reflexive or self-reinforcing. Such reflexive “neural net” architecture would eventually become the basis of machine learning. Deeply-layered artificial neural networks became very good at detecting patterns, and began to acquire capabilities that were “emergent”, meaning that they cannot be deduced or predicted ahead of time relying only on a knowledge of how the system is constructed. They must be observed, as one would observe a biological entity to learn its behavior.
It is with this “deep learning” architecture that large language models would eventually be built. Crucial to the endeavor is massive amounts of data, used to train the model on the patterns that are extant in real human language. Though they still sometimes “hallucinate,” LLMs are now able to mimic the common sense of a real person simply because instances of dogs jumping over houses (to take my previous example) are rare in human utterances. With enough data, this absence, like any pattern, can be detected and replicated in the LLM’s output.
The current wave of robotics seeks to do with embodied action what LLMs did with language: to build up inductively, from massive amounts of data, a functional resemblance between a human capacity and a machine function. Conceptually, this aspiration remains close to Brooks’ “intelligence without representations,” in the sense that the robot is not trying to develop an action plan based on a “model” of the environment.
But Brooks’ effort was essentially that of an engineer: finding possible sources of order or “fit” between an environment and an agent that can be exploited, with the right design, so that some actions “come naturally,” as it were, to the robot without need for minute command and control by a separate brain that holds a picture of the environment and uses it to generate motor commands. We might say that in Brooks’ approach, the “modelling” is done not by the robot, but by the human designer who finds and exploits these affordances for action, which can then serve as a mindless, purely mechanical control algorithm.
By contrast, the bet being made by the firm Physical Intelligence, as near as I can tell (I am not a computer scientist), is that this stage of initial design by a human engineer can eventually be phased out. Simply make the robot adhere to the patterns of movement typical of a skilled agent, just as LLMs adhere to the patterns of natural human speech, as detected in a mountain of data. The goal is to develop robots that are generalists, able to acquire skill in novel and diverse settings.
But what is the data, in this case? There is a whole universe of language that can be scraped from the Internet as training data, but what comparable source of data for the development of physical intelligence can be harvested at Internet scale?
The visual analogue of LLMs is VLMs: visual language models. VLMs currently in wide use include Gemini and GTP4-V. You can use them to generate realistic images. Physical Intelligence says it is using VLMs to train its robots. It makes sense that VLMs can inform the sensorimotor challenge of acting in the world. But linking visual models to motor actions is far from trivial. The motor part of the sensorimotor intelligence of the animal brain evolved in tandem with the sensory part, as a unified accomplishment.
I am not a computer scientist, and am not able to fully follow the technical papers that Physical Intelligence has published on its website. But it may be worthwhile to sketch the challenge that any such effort faces. In the next instalment of this series, I will do that. Thank you for reading.
-Matt Crawford




The economic, societal and political implications of Physical intelligence - or autonomous labor - will be even more profoundly disruptive to societies and economies than AI, and 'leaders' are completely unprepared to deal with this. The fundamental changes that robotics will bring are going to shake so many of our foundational structures to their core and make us question a lot of beliefs and policy frameworks that are 'settled'.
Take demographics. For at least 20 years, leaders in Europe and the US have taken it as established truth that immigration is necessary to counter falling demographics. That because of falling birth and worker-to-retiree ratio's, it is absolutely necessary to import people - or 'labour' - to be able to maintain economic growth and fund the social services costs of growing retiree populations. But with autonomous labor, this truism is not only wrong, it is completely 180 degrees wrong. In this new era of mass job replacement by robot, immigrants just add to the problem of what to do with suddenly 'surplus' labor. This is obviously going to be absolutely explosive politically and socially.
What is the social, political, economic and demographic structure that benefits most from this brave new world? The ideal (practically, not necessarily morally) characteristics of a country in robot world is that it has cheap energy, lots of capital (to invest in new zero human robo factories), good logistics, very little legacy demographics that have to be supported with social services, and it has - or can attract - the smartest, most high value adding people to create, engineer and manage the new companies that emerge.
That basically describes the UAE. Think about it in contrast to the legacy economic powerhouses the US, Germany, Japan and China. Those countries may have a huge problem on their hand with the combination of aging populations (expensive social services), and large working age populations whose services are suddenly no longer nearly as much in demand. Plus Germany and the US have the added problem of increasing tensions with immigration.
Contrast that with the UAE. Its industrial development model has been to import enormous numbers of foreign workers to build all of its infrastructure, and then send them home when they are done. No legacy social services costs, no mass of newly underemployed (and angry) population to deal with. It has the cheap energy and basically unlimited investment capital to massively build out and operate a new robo industrial zone, and with zero income taxes it is already attracting high value professionals from high tax locations. It also has a modern logistics network that gives it advantages globally, and can allow it to totally dominate regionally (assuming the Iran situation is sorted).
It is going to be a wild ride...
If I was teaching my robot how to move, I would just make it watch Crouching Tiger Hidden Dragon on a continuous loop. Then, watch out!