
Defending against AI jailbreaks
Anthropic researchers, Mrinank Sharma, Jerry Wei, Ethan Perez and Meg Tong discuss a system based on Constitutional Classifiers that guards models against jailbreaks.
Read more: https://www.anthropic.com/news/constitutional-classifiers
0:00 Introduction
0:39 Defining jailbreaks and their importance
3:35 Universal jailbreaks
10:24 The Swiss cheese model for safety
11:25 Explaining Constitutional Classifiers
14:11 Ensuring model helpfulness
17:30 Understanding the constitution and synthetic data
19:00 Flexibility of the constitutional approach
24:15 Origins of the constitutional classifiers approach
32:24 Progress on robustness
38:47 The public demo: Purpose, setup
47:42 Understanding whether the approach is safe in practice
54:05 The public demo: Approaches people tried to bypass classifiers
56:14 Benefits of the classifier approach for Claude users
1:00:18 Memorable moments from the project
1:08:20 Differences in approach between this project and other research
1:11:11 The evolution of AI safety research
Read more: https://www.anthropic.com/news/constitutional-classifiers
0:00 Introduction
0:39 Defining jailbreaks and their importance
3:35 Universal jailbreaks
10:24 The Swiss cheese model for safety
11:25 Explaining Constitutional Classifiers
14:11 Ensuring model helpfulness
17:30 Understanding the constitution and synthetic data
19:00 Flexibility of the constitutional approach
24:15 Origins of the constitutional classifiers approach
32:24 Progress on robustness
38:47 The public demo: Purpose, setup
47:42 Understanding whether the approach is safe in practice
54:05 The public demo: Approaches people tried to bypass classifiers
56:14 Benefits of the classifier approach for Claude users
1:00:18 Memorable moments from the project
1:08:20 Differences in approach between this project and other research
1:11:11 The evolution of AI safety research
Anthropic
We’re an AI safety and research company. Talk to our AI assistant Claude on claude.com. Download Claude on desktop, iOS, or Android.
We believe AI will have a vast impact on the world. Anthropic is dedicated to building systems that people can rely on a...