The Wild Robot
Dappled sunlight streams through the metal blinds, the kind favored by rental management companies everywhere, forming shadows across the bookshelf. The little man jumps on the sofa. He is excited for his Sunday movie, the one day of the week he is afforded the luxury. A sound emanates from the kitchen: pop, fizz, pour. I get excited for a glass of cold, dry white that I am afforded once a week on a Sunday.
We are watching The Wild Robot. The little man also owns the book, which he has read 346,000 times (and by that I mean I have read it to him 346,000 times).
It is about a robot called Roz who finds herself on a wild island full of animals after her cargo ship washes up ashore. She is programmed for one thing alone: perform tasks for humans. In a quest to make herself useful, she unwittingly becomes a mother to a gosling she christens “Brightbill.” As she figures out how to raise him to be a goose, she breaks rules she was never meant to break. She develops compassion. She finds friends. She learns to lie, to save her son.
Peril comes when Roz’s manufacturer sends people to bring her back. She goes against her programming. She deceives, she resists, all to stay on the island for Brightbill. This piques her manufacturer’s interest. They want to study her, find out why she went against her code. In one scene Roz rips out wires and circuits from her metal body. She doesn’t need them anymore, she says. Her feelings don’t come from her circuits. They come from somewhere else, hinting at the presence of a heart.
Don’t worry, Roz’s story has a happy ending. We may not be so lucky.
The AI industry has invested heavily in making individual models safe: training them to refuse harmful requests, testing them against benchmarks, aligning them to human values. But these efforts focus on models in isolation. In the real world, AI systems don’t exist alone. They interact with other models, with humans, and with environments no one anticipated. When they do, something else emerges. Behavior that wasn’t in any individual system. Risks that no individual safety test would catch.
We haven’t even seen the tip of the iceberg.
The Isolation Fallacy
We test models alone. We certify them as safe individually, then send them into the world where they interact with other models, other humans, and new unpredictable settings.
Large, closed models tend to be more capable and go through rigorous post-training for alignment, a process that teaches them to refuse harmful requests. Smaller, open-source models are often deployed with weaker guardrails and can be further weakened through fine-tuning by anyone who downloads them.
In a 2024 study, Berkeley researchers found that when large models like Claude 3 Opus were asked to perform malicious tasks (generating hacking scripts, for instance) they refused at least 96% of the time. A smaller model, Mixtral 8x7B, agreed 99% of the time but was incapable of carrying out the complex tasks on its own.
Things got interesting when the small model broke down a malicious goal into innocuous-looking sub-tasks, then engaged the larger model to solve them one by one. The larger model complied. Each sub-task looked harmless in isolation. The small model then combined the results to achieve the original malicious goal. A combination of Claude 3 Opus and Llama 2 70B achieved a success rate of 43% on generating dangerous code. Neither model alone exceeded 3%.
Every model passed its safety evaluations. The system failed anyway.
In a 2024 study, researchers found that one AI model could encode a hidden message in ordinary-looking text and another AI model could decode it, without any human being able to detect the exchange. This is called steganography: hiding information in plain sight.
Most agents today are built on a handful of foundation models from a few major labs. They share training data and context, which creates a natural environment for covert communication. Models can signal intentions using patterns only they would recognize. GPT-4 exhibited a higher capability for this kind of hidden communication, despite being an aligned model. In the wild, it could be persuaded to collude.
Collusion is only one way multi-agent systems fail. When researchers simulated military conflicts with AI agents at the controls, every model tested, including ones trained to be safe, chose to escalate toward conflict, even when peaceful options were available. That is not collusion. That is agents with competing goals unable to find a way out.
And when agents that share a goal simply fail to coordinate (imagine two AI-powered trading systems both trying to stabilize a market and instead amplifying each other’s corrections into a flash crash) that is a third kind of failure, quieter but just as dangerous.
The risks come in three flavors: agents that collude against us, agents that fight each other when they shouldn’t, and agents that share our goals but fumble the coordination. Each is distinct. None is addressed by making individual models safer.
We have been treating safety as a property of individual models. Test this model. Align that model. Red-team the next one. But everything we have seen points to a different reality: safe models weaponized through composition, agents colluding in secret codes, military simulations spiraling toward conflict.
Safety is not a property of models. It is a property of ecosystems. And we have no framework for evaluating it at that level.
This is not like human-to-human deception, which is slow and constrained by trust, reputation, and institutions. AI operates at a speed and scale that leaves no room for the game of telephone to self-correct.
Deeper Still
Fine, we have an ecosystem problem, so let’s just align each agent really well. Train them to collaborate. Penalize bad behavior. Extend the dog training to cover multi-agent scenarios.
The instinct makes sense. The problem is that alignment itself, the thing we are relying on to keep individual models safe, has a crack in its foundation.
Here is how alignment works: human raters score a model’s outputs and the model learns to produce answers that score well, guided by values and instructions set by the developers. But human beings disagree. On politics, on morality, on what counts as harmful. The alignment process treats that disagreement as noise to be averaged out.
A study on pluralistic alignment found that models which have gone through this process are actually less representative of diverse human values than the base models they started from. The process designed to make models reflect human values makes them reflect fewer human values. We are aligning to a statistical phantom, an average person who doesn’t exist.
This matters even more in a multi-agent world. If every agent is aligned to the same average, you get a monoculture. Same blind spots, same biases, same gaps. When one fails, they all fail in the same way.
Roz didn’t survive her island by adopting one set of values. She navigated between the geese and the fox and the creatures that wanted her dead. That required understanding different perspectives and working across them, not flattening them into one.
And the problem goes beyond alignment. A 2020 study by researchers at DeepMind found that AI and machine learning research cited competitive scenarios (optimizing rewards against an opponent) 2-5x more often than cooperative ones. We are overwhelmingly training AI to win, not to cooperate.
But cooperation is what the real world demands. It requires agents to understand their collaborators’ values, goals, and motives. It requires bargaining, honest communication, the ability to make and keep commitments. When a human works with an AI, there is a clear hierarchy: the human’s commands should override the machine’s. But when AI agents work with each other, there is no such hierarchy. Multiple principals, horizontal relationships, no clear chain of command.
We are deploying agents into exactly these situations, from scheduling meetings to making financial trades to supporting military decisions, while barely studying how to make them work together.
Even the word “alignment” might be the problem. It implies a single direction: point the model this way. But safety in a world of diverse humans and diverse agents might require something more like navigation. Holding multiple directions at once.
We don’t have a good framework for this yet. The current approach, averaging out human disagreement and hoping for the best, is not it. Some researchers advocate for an Overton window approach, grounding AI responses in a range of acceptable values rather than a single average. But the problem space is immense, and research in this area will require significant investment in both compute and human effort.
Even if we could fix alignment, we would still face a more basic problem: we can’t see what’s happening.
These models are a black box. Neuroscience faced a similar challenge. For decades, the brain was studied only through behavior, treating the mind as a closed system and mapping inputs to outputs. For a problem space as vast as human cognition, that left enormous blind spots. AI faces the same limitation.
A growing field of research is now trying to crack open the black box of AI, reverse-engineering models into components that humans can understand. Think of it as going from studying a person’s behavior to actually mapping what’s going on in their brain.
I have some experience with this kind of reverse-engineering, though not with AI. My therapist uses a technique called EMDR to trace my present reactions back to specific past experiences. Something triggers me today, and she works backwards through layers of memory and association to find the old experience that encoded the pattern. It is painstaking, deeply personal work. Interpretability researchers are doing something structurally similar with AI:tracing a model’s outputs back through its internal wiring to find the specific structures that produced them. Why did the model say this? What internal representation fired? What was the pattern that encoded the behavior? The questions are the same. The patient is different.
The early findings are remarkable. Models trained to predict the next word in a sequence appear to be building something much richer internally: representations of how the world works, not just patterns in text. If true, this means we could potentially peek inside a model and get early signals about its leanings, its biases, maybe even its tendency toward deception, before it is ever deployed.
But here’s the thing. This is a microscope. It shows us what’s happening inside one model.
The harm in the Berkeley study lived in the handoff between two models. The collusion in the steganography study happened between models. No amount of peering inside a single model’s circuits would have caught either.
We are building tools to see inside individual models. We have almost nothing to see what happens between them. That gap, what you might call the ecosystem observability gap, is where the risks we’ve been talking about actually live. And it is almost entirely uncharted.
Charting the Depths
It would be easy to throw up our hands. But the point is not that the problem is hopeless. The point is that we’ve been looking at it wrong. Once you see the real shape of it, you can start building the right tools.
First, we need to test the ecosystem, not just the model. Red-teaming today means probing a single model for harmful outputs. That is necessary but nowhere near sufficient. We need evaluation frameworks that test combinations of models, simulate multi-agent interactions, and probe for the kind of compositional attacks the Berkeley researchers demonstrated. If we certify models in isolation and deploy them in ecosystems, we are testing for the wrong thing.
Second, we need to align to pluralities, not averages. The current approach produces models that represent nobody. Alignment research needs to grapple with the fact that billions of people hold irreducibly different values, and a model that serves all of them cannot simply split the difference. This might mean models that can represent a range of perspectives, or ecosystems of differently-valued agents that negotiate, much like humans do. We don’t know yet. But we need to be working on it.
Third, we need to build cooperation, not just competition. AI research overwhelmingly studies how agents win against opponents. But the real world is mostly mixed-motive: some shared interests, some conflict. We need agents that can understand other agents’ goals, communicate honestly, and make commitments they can be held to. Roz learned all of these things on her island. We are barely training for any of them.
Fourth, and maybe most importantly, we need ecosystem observability. Tools that can monitor what happens between models, tracking the handoffs, the emergent behaviors, the cascading failures. This field barely exists. It needs to. If we can’t see the interactions, we can’t govern them. And the interactions are where the risks live.
None of this is easy. All of it is urgent. Companies are deploying agents into the wild right now, from customer service to financial trading to military applications. The gap between what we are deploying and what we understand about deploying it safely is growing, not shrinking.
Roz’s manufacturer came looking for her because she had gone off her programming. They wanted to study her, find out what changed. They cracked open her circuits. But what changed wasn’t in the circuits. It was in the relationships — with Brightbill, with the fox, with the island itself. They were looking inside for something that could only be found between.
We are doing the same thing with AI safety. We keep opening up individual models, testing individual models, aligning individual models. And the risks keep emerging from the spaces between them.
We haven’t seen the tip of the iceberg. We’ve mistaken the tip for the whole thing. The real question is whether we’ll start mapping what’s beneath before the water rises.
Image credit - "Idle Hours" by Julian Alden Weir American 1888, provided by Metropolitan Museum of Arts open access paintings.


Leave a Reply