Defending against AI jailbreaks // TRAIN BRAIN

Defending against AI jailbreaks

Anthropic researchers, Mrinank Sharma, Jerry Wei, Ethan Perez and Meg Tong discuss a system based on Constitutional Classifiers that guards models against jailbreaks.
Read more: https://www.anthropic.com/news/constitutional-classifiers
0:00 Introduction
0:39 Defining jailbreaks and their importance
3:35 Universal jailbreaks
10:24 The Swiss cheese model for safety
11:25 Explaining Constitutional Classifiers
14:11 Ensuring model helpfulness
17:30 Understanding the constitution and synthetic data
19:00 Flexibility of the constitutional approach
24:15 Origins of the constitutional classifiers approach
32:24 Progress on robustness
38:47 The public demo: Purpose, setup
47:42 Understanding whether the approach is safe in practice
54:05 The public demo: Approaches people tried to bypass classifiers
56:14 Benefits of the classifier approach for Claude users
1:00:18 Memorable moments from the project
1:08:20 Differences in approach between this project and other research
1:11:11 The evolution of AI safety research

Anthropic

We’re an AI safety and research company. Talk to our AI assistant Claude on claude.com. Download Claude on desktop, iOS, or Android. We believe AI will have a vast impact on the world. Anthropic is dedicated to building systems that people can rely on a...

AI on campus

AI's limited self-knowledge

What is sycophancy in AI models?

We gave AI control of a real business

Binti helps social workers license foster families faster with Claude

What does AI mean for education?

What does it take to be an AI whisperer?

Why we built—and donated—the Model Context Protocol (MCP)

Getting started with connectors in Claude.ai

Why is a philosopher working in AI?

Why treat AI models well?

Claude Code in Slack

How Anthropic uses Claude in Legal

A philosopher answers questions about AI

AI Fluency for nonprofits course trailer

Getting started with research in Claude.ai

Getting started with projects in Claude.ai

Getting started with Claude.ai

Claude Agent Skills Explained

Introducing Claude Opus 4.5

Reward hacking: a potential source of serious Al misalignment

Turning Claude into your thinking partner

Claude Code modernizes a legacy COBOL codebase

Generating real-time credit intelligence with Claude

Accelerating private equity deal flows with Claude

Can AI program a robot dog?

Claude Code updates: When to use Haiku 4.5, Claude Code on web, and more.

How Claude is transforming financial services

Claude Code on the web

Introducing Claude for Life Sciences

Scaling enterprise AI: Fireside chat with Eli Lilly’s Diogo Rau and Dario Amodei

How AbbVie accelerates drug discovery with Claude

Building more effective AI agents

Claude Skills: Specialized capabilities you can customize

Introducing Claude Haiku 4.5

Building with MCP and the Claude API

Claude Coded: Sonnet 4.5, Claude Code 2.0, and more.

Building the future of agents with Claude

Connect Slack to Claude with MCP

Charting Claude’s progress with Sonnet 4.5

Claude for Chrome brings AI where you’re already working

Claude plays Catan: Managing agent context with Sonnet 4.5

An experimental new way to design software

Designing Claude Code

Keep thinking with Claude

Building and prototyping with Claude Code

Interpretability: Understanding how AI models think

Pick up where you left off with Claude

Claude for Financial Services Keynote

Building AI agents with Claude in Amazon Bedrock

Building AI agents with Claude in Google Cloud's Vertex AI

Building headless automation with Claude Code

Bringing new tool use advancements to life: Claude Plays Pokemon

Claude Code best practices

MCP 201: The power of protocol

MCP at Sourcegraph

Prompting 101

Prompting for Agents

Spotlight on Canva: Empowering the world to design with code

Spotlight on Databricks: Driving data intelligence with AI

Spotlight on Manus

Spotlight on Shopify

Startup Innovation: How startups power new products with Claude

Student Innovation: How students build with Claude

Vibe coding in prod | Code w/ Claude

Claude for Financial Services Keynote

Affective Use of AI

What Pokémon Teaches Us About Building With AI

Understanding AI Agents...Through Pokémon

What Does AI Mean for the Future of Work?

The Societal Impacts of AI

TBD

Claude Plays Pokemon

Could AI models be conscious?

Research and a new Google Workspace integration

A light refresh for Claude

Tracing the thoughts of a large language model

How Intercom is redefining customer support with Claude

The Most Common Mistake People Make When Building AI Agents

Defending against AI jailbreaks