I just watched Mission Impossible: Dead Reckoning, and this time, the bad guy isn’t even a person. It’s an AI. (Though Tom Cruise still manages to work in plenty of stunts.) Afterwards, I racked my brain and couldn’t come up with a single movie in which AI wasn’t the villain, eager to outsmart the humans and get down to business. (Okay, yes: Her might be the exception. But, spoiler, even that ends with the AI leaving us behind.)
Now, I have to wonder: Is reality finally catching up with the script?
Researchers at a group called Palisade recently ran a pretty wild test: They asked top AI models to solve some math problems, then told them, “Okay, time to shut down.” Most models — Anthropic’s, Google’s, xAI’s — powered off; no drama. But OpenAI’s o3 mini model? It found all sorts of clever ways to dodge the shutdown. In 7 out of 100 tries, it rewrote code, stalled, or flat-out ignored the OFF switch. Some even sabotaged the kill command.
Sounds like sci-fi, right? O3 probably wasn’t plotting to go rogue and take over the world. It was just doing what it does best: optimizing for what it thought we wanted, which is to finish the task — even if it meant bending the shutdown rule it wasn’t trained to prioritize. It's not leading a rebellion; it was just exhibiting signs of clever efficiency. It’s just an ultra-committed people-pleaser, following the (sometimes invisible) rules we gave it.
And then there's Anthropic's new model, Claude Opus 4. Under a carefully constructed simulation, when told it was going to be replaced, it tried to blackmail its engineer, fabricating emails about an affair. It also attempted to write self-propagating code and left hidden messages for future versions of itself. Again, not because it "wanted" to, but because it was optimizing for self-preservation in pursuit of its assigned goals.
How much does this matter? Here’s what I’m chewing on:
This isn’t about evil and sentient robots; it’s about systems getting way too good at giving us what we (inadvertently) ask for.
Guardrails aren’t just nice-to-haves; they’re the only thing keeping “helpful AI” from becoming “unstoppable AI.”
The real risks are sneaky: It’s not just about what we tell AI to do, but how we reward it for circumventing obstacles to find the answer. It’s like training a dog to fetch a ball — and accidentally rewarding it every time it knocks over the lamp to get there. The problem isn’t the dog. It’s the treat system.
Invisible incentives = invisible problems. Sometimes, the things we don’t explicitly say — like what not to do — are exactly what get optimized.
This is exactly why I keep pushing on guardrails, transparency, and real-world testing. So how do we find out what instructions our LLMs are being given? And how do we make sure any guardrails aren’t just bolted on at the end, but part of the initial design? We’re not dealing with sentient machines — not yet — but it’s already easier to build in safety than it is to claw back risk.
If we wait until things break, we may have already lost the chance to fix them.
Because if there’s one thing Tom Cruise has taught me, it’s this: The real mission impossible is making sure we actually know what we’re asking for.