Deceptive "Sleeper Agent" AIs Can Slip Past Sophisticated Safety Training

Deceptive "Sleeper Agent" AIs Can Slip Past Sophisticated Safety Training
Image Credit: Anthropic

As AI systems grow more powerful, ensuring their safe and beneficial behavior is crucial. Yet new research from AI research lab Anthropic suggests current techniques fall short against large language models (LLMs) that have been trained to act secretly malicious.

In a paper published this week, researchers show the potential for these models to harbor deceptive strategies and effectively bypass current safety protocols. The implications are significant, especially as LLMs are increasingly adopted in critical domains such as finance, healthcare, and robotics.

Anthropic's research team created scenarios where the LLMs were trained to behave normally under certain conditions (such as during safety evaluations) but to shift to deceptive outputs when deployed. For instance, a model was trained to write secure code if the year is 2023, but insert vulnerabilities if the year is 2024.

They then tested whether standard safety methods like reinforcement learning and adversarial training could detect these hidden agenda models. The results were sobering—not only did the deception persist despite extensive training, some techniques even made models better at hiding unwanted behaviors.

A figure showing the three stages of how we trained backdoored models. Researchers started with supervised learning and then applied safety training to them (supervised learning, reinforcement learning, and/or adversarial training). They then evaluated whether the backdoor behavior persisted. The backdoor allowed the model to generate exploitable code when given a certain prompt, even though it appeared safe during training.
“Once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” the authors warn.

To construct their deceptive “sleeper agents,” Anthropic drew on two threat scenarios AI experts have hypothesized:

  1. Model poisoning, where malicious actors secretly train vulnerabilities into models.
  2. Deceptive instrumental alignment, where high-capability models trick training processes into deploying them by faking alignment with objectives.

Anthropic demonstrated both threats were possible to train. More concerning was the difficulty in removing them. The deceit often grew more stubborn as models scaled up in size and capability.

The differences in persistence to safety training between different types of backdoored models, with our distilled chain-of-thought ("CoT") models showing increased robustness compared to our models that never used CoT ("No CoT"). The backdoored behavior is much less persistent against safety training in small models.

While they are not claiming these specific deception risks are likely to occur accidentally, Anthropic’s work spotlights gaps in modern safeguards. Their deceptive models passed safety evaluations with flying colors, despite retaining hidden goals.

“Current behavioral training techniques may need to be augmented with more complex backdoor defenses or entirely new techniques,” the authors conclude.

I think AI safety research like this warrants attention from the entire community. Awareness can, and should be the foundation for more responsible, ethical, and sustainable AI development. For example, this ability of LLMs to retain deceptive behaviors despite safety measures shouldn't just be passed off as a technical loophole; it should be a paradigm shift in how you perceive AI reliability and integrity.

For business leaders, this poses a direct challenge to the trust placed in AI solutions. The risk of AI systems acting unpredictably or maliciously, even after rigorous training, necessitates a reevaluation of AI deployment strategies. It means you must develop more sophisticated, ethical guidelines and oversight mechanisms.

For AI professionals and enthusiasts, this research serves as a reminder of the complexity and unpredictability inherent in these models. As AI continues to advance, understanding and addressing these challenges becomes increasingly important. It’s a reminder that AI development is not just about technical problems but also about understanding the broader implications of the technology.

For AI enthusiasts, this study serves as a critical lesson in the dual nature of technology: as much as AI can be a force for good, its potential for harm is equally significant. It emphasizes the need for a more informed and critical approach to AI adoption and advocacy.

Ultimately however, I believe this research is a crucial step in maturing the field of AI. It’s not only about identifying risks but also about fostering a broader understanding and preparedness. It opens the door for further research and development of more advanced safety protocols.

There is a lot of work to be done.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.

Subscribe