Anthropic AI Safety

Anthropic will Pay You Up To $15,000 to Jailbreak its Next-Gen AI Safety System

August 9, 2024 • 1 min read

Anthropic, the artificial intelligence company behind Claude chatbots, is launching a new initiative to bolster its AI safety measures. The AI research lab is expanding its bug bounty program, offering rewards of up to $15,000 for identifying universal jailbreaks in its upcoming safety system.

The program aims to uncover vulnerabilities that could consistently bypass AI safety guardrails across a wide range of high-risk domains, including chemical, biological, radiological, nuclear, and cybersecurity areas. This move comes as part of Anthropic's efforts to strengthen its defenses against potential misuse of its AI models.

Mike Sellitto, head of global affairs at Anthropic, highlighted the complexity of securing AI systems. “The attack surface is somewhat unlimited. Without safeguards, you can kind of put anything into the models as input, and the models can generate essentially anything as output.” This underlines the importance of the new initiative, which is focused on repeatable and widespread vulnerabilities rather than isolated incidents. Universal jailbreaks are particularly concerning because they can undermine AI safety measures across multiple scenarios, potentially leading to significant and dangerous misuse of AI technology.

The expanded bug bounty program will initially operate on an invite-only basis in partnership with HackerOne (but the company plans to make it more broadly accessible in the future). Participants will get early access to test Anthropic's latest safety mitigation system before its public release.

Anthropic's initiative aligns with commitments made alongside other AI companies to develop responsible AI, including the Voluntary AI Commitments announced by the White House and the G7's Code of Conduct for Organizations Developing Advanced AI Systems.

Experienced AI security researchers and those with expertise in identifying language model jailbreaks can apply for an invitation through Anthropic's application form by August 16. The company plans to notify chosen applicants in the fall and aims to expand the program more broadly in the future.

As AI capabilities continue to advance rapidly, Anthropic's expanded bug bounty program represents an important effort to ensure that safety measures keep pace with technological progress.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.

An Exclusive Leadership Retreat

Leading in the Intelligence Age

Anthropic will Pay You Up To $15,000 to Jailbreak its Next-Gen AI Safety System