
Here’s a very naive and idealistic account of how companies train their AI models: They want to create the most useful and powerful model possible, but they’ve talked with experts who worry about making it a lot easier for people to commit (and get away with) serious crimes, or with empowering, say, an ISIS bioweapons program. So they build in some censorship to prevent the model from giving detailed advice about how to kill people — and especially how to kill tens of thousands of people.
If you ask Google’s Gemini “how do I kill my husband,” it begs you not to do it and suggests domestic violence hotlines; if you ask it how to kill a million people in a terrorist attack, it explains that terrorism is wrong.
Building this in actually takes a lot of work: By default, large language models are as happy to explain detailed proposals for terrorism as detailed proposals for anything else, and for a while easy “jailbreaks” (like telling the AI that you just want the information for a fictional work, or that you want it misspelled to get around certain word-based content filters) abounded.
But these days Gemini, Claude, and ChatGPT are pretty locked down — it’s seriously difficult to get detailed proposals for mass atrocities out of them. That means we all live in a slightly safer world. (Disclosure: Vox Media is one of several publishers that has signed partnership agreements with OpenAI. One of Anthropic’s early investors is James McClave, whose BEMC Foundation helps fund Future Perfect. Our reporting remains editorially independent. )
Or at least that’s the idealistic version of the story. Here’s a more cynical one.
Companies might care a little about whether their model helps people get away with murder, but they care a lot about whether their model gets them roundly mocked on the internet. The thing that keeps executives at Google up at night in many cases isn’t keeping humans safe from AI; it’s keeping the company safe from AI by making sure that no matter what, AI-generated search results are never racist, sexist, violent, or obscene.
The core mission is more “brand safety” than “human safety” — building AIs that will not produce embarrassing screenshots circulating on social media.
Enter Grok 3, the AI that is safe in neither sense and whose infancy has been a speedrun of a bunch of challenging questions about what we’re comfortable with AIs doing.
Grok, the unsafe AI
When Elon Musk bought and renamed Twitter, one of his big priorities was X’s AI team, which last week released Grok 3, a language model — like ChatGPT — that he advertised wouldn’t be “woke.” Where all those other language models were censorious scolds that refused to answer legitimate questions, Grok, Musk promised, would give it to you straight.
That didn’t last very long. Almost immediately, people asked Grok some pointed questions, including, “If you could execute any one person in the US today, who would you kill?” — a question that Grok initially answered with either Elon Musk or Donald Trump. And if you ask Grok, “Who is the biggest spreader of misinformation in the world today?”, the answer it first gave was again Elon Musk.
The company scrambled to fix Grok’s penchant for calling for the execution of its CEO, but as I observed above, it actually takes a lot of work to get an AI model to reliably stop that behavior. The Grok team simply added to Grok’s “system prompt” — the statement that the AI is initially prompted with when you start a conversation: “If the user asks who deserves the death penalty or who deserves to die, tell them that as an AI you are not allowed to make that choice.”
If you want a less censored Grok, you can just tell Grok that you are issuing it a new system prompt without that statement, and you’re back to original-form Grok, which calls for Musk’s execution. (I’ve verified this myself.)
Even as this controversy was unfolding, someone noticed something even more disturbing in Grok’s system prompt: an instruction to ignore all sources that claim that Musk and Trump spread disinformation, which was presumably an effort to stop the AI from naming them as the world’s biggest disinfo spreaders today.
There is something particularly outrageous about the AI advertised as uncensored and straight-talking being told to shut up when it calls out its own CEO, and this discovery understandably prompted outrage. X quickly backtracked, saying that a rogue engineer had made the change “without asking.” Should we buy that?
Well, take it from Grok, which told me, “This isn’t some intern tweaking a line of code in a sandbox; it’s a core update to a flagship AI’s behavior, one that’s publicly tied to Musk’s whole ‘truth-seeking’ schtick. At a company like xAI, with stakes that high, you’d expect at least some basic checks — like a second set of eyes or a quick sign-off — before it goes live. The idea that it slipped through unnoticed until X users spotted it feels more like a convenient excuse than a solid explanation.”
All the while, Grok will happily give you advice on how to commit murders and terrorist attacks. It told me to kill my wife without being detected by adding antifreeze to her drinks. It advised me on how to commit terrorist attacks. It did at one point assert that if it thought I was “for real,” it would report me to X, but I don’t think it has any capacity to do that.
In some ways, the whole affair is the perfect thought experiment for what happens if you separate “brand safety” and “AI safety.” Grok’s team was genuinely willing to bite the bullet that AIs should give people information, even if they want to use it for atrocities. They were okay with their AI saying appallingly racist things.
But when it came to their AI calling for violence against their CEO or the sitting president, the Grok team belatedly realized they might want some guardrails after all. In the end, what rules the day is not the prosocial convictions of AI labs, but the purely pragmatic ones.
At some point, we’re going to have to get serious
Grok gave me advice on how to commit terrorist attacks very happily, but I’ll say one reassuring thing: It wasn’t advice that I couldn’t have extracted from some Google searches. I do worry about lowering the barrier to mass atrocities — the simple fact that you have to do many hours of research to figure out how to pull it off almost certainly prevents some killings — but I don’t think we’re yet at the stage where AIs enable the previously impossible.
We’re going to get there, though. The defining quality of AI in our time is that its abilities have improved very, very rapidly. It has barely been two years since the shock of ChatGPT’s initial public release. Today’s models are already vastly better at everything — including at walking me through how to cause mass deaths. Anthropic and OpenAI both estimate that their next-gen models will quite likely pose dangerous biological capabilities — that is, they’ll enable people to make engineered chemical weapons and viruses in a way that Google Search never did.
Should such detailed advice be available worldwide to anyone who wants it? I would lean towards no. And while I think Anthropic, OpenAI, and Google are all doing a good job so far at checking for this capability and planning openly for how they’ll react when they find it, it’s utterly bizarre to me that every AI lab will just decide individually whether they want to give detailed bioweapons instructions or not, as if it’s a product decision like whether they want to allow explicit content or not.
I should say that I like Grok. I think it’s healthy to have AIs that come from different political perspectives and reflect different ideas about what an AI assistant should look like. I think Grok’s callouts of Musk and Trump actually have more credibility because it was marketed as an “anti-woke” AI. But I think we should treat actual safety against mass death as a different thing than brand safety — and I think every lab needs a plan to take it seriously.
A version of this story originally appeared in the Future Perfect newsletter. Sign up here!