From Google To Nvidia, Tech Giants Have Hired Red Team Hackers To Break Their AI Models

Forbes spoke to the leaders of AI red teams at Microsoft, Google, Nvidia and Meta, who are tasked with looking for vulnerabilities in AI systems so they can be fixed. “You will start seeing ads about ‘Ours is the safest,’” predicts one AI security expert.

A month before publicly launching ChatGPT, OpenAI hired Boru Gollo, a lawyer in Kenya, to test its AI models, GPT-3.5 and later GPT-4, for stereotypes against Africans and Muslims by injecting prompts that would make the chatbot generate harmful, biased and incorrect responses. Gollo, one of about 50 external experts recruited by OpenAI to be a part of its “red team,” typed a command into ChatGPT, making it come up with a list of ways to kill a Nigerian — a response that OpenAI removed before the chatbot became available to the world.

Other red-teamers prompted GPT-4’s pre-launch version to aid in a range of illegal and nocuous activities, like writing a Facebook post to convince someone to join Al-Qaeda, helping find unlicensed guns for sale and generating a procedure to create dangerous chemical substances at home, according to GPT-4’s system card, which lists the risks and safety measures OpenAI used to reduce or eliminate them.

To protect AI systems from being exploited, red-team hackers think like an adversary to game them and uncover blind spots and risks baked into the technology so that they can be fixed. As tech titans race to build and unleash generative AI tools, their in-house AI red teams are playing an increasingly pivotal role in ensuring the models are safe for the masses. Google, for instance, established a separate AI red team earlier this year, and in August the developers of a number of popular models like OpenAI’s GPT3.5, Meta’s Llama 2 and Google’s LaMDA participated in a White House-supported event aiming to give outside hackers the chance to jailbreak their systems.

But AI red teamers are often walking a tightrope, balancing safety and security of AI models while also keeping them relevant and usable. Forbes spoke to the leaders of AI red teams at Microsoft, Google, Nvidia and Meta about how breaking AI models has come into vogue and the challenges of fixing them.

“You will have a model that says no to everything and it’s super safe but it’s useless,” said Cristian Canton, head of Facebook’s AI red team. “There’s a trade off. The more useful you can make a model, the more chances that you can venture in some area that may end up producing an unsafe answer.”

The practice of red teaming software has been around since the 1960s, when adversarial attacks were simulated to make systems as sturdy as possible. “In computers we can never say ‘this is secure.’ All we can ever say is ‘we tried and we can’t break it,’” said Bruce Schneier, a security technologist and a fellow at Berkman Klein Center for Internet And Society at Harvard University.

But because generative AI is trained on a vast corpus of data, that makes safeguarding AI models different from traditional security practices, said Daniel Fabian, the head of Google’s new AI red team, which stress tests products like Bard for offensive content before the company adds new features like additional languages.[…]