
Atomic Facts to Structured Knowledge: Rethinking Unlearning & Jailbreaking in Large Language Models
A Google TechTalk, 2026-02-11, presented by Rongzhe Wei
ABSTRACT: Large language models are increasingly deployed in high-impact settings, making trust and safety central concerns. A growing body of evidence suggests that many failures in these systems share a common root cause: knowledge in LLMs is not stored as isolated atomic facts, but as structured and interdependent internal representations. This talk argues for a shift from atomic views of model knowledge toward structured internal knowledge modeling, and shows how this perspective fundamentally reshapes our understanding of both unlearning and jailbreaking. On the trust side, by modeling an LLM’s internal correlated knowledge as a structured representation, we reveal why existing unlearning methods often achieve only superficial forgetting: even when a target fact is suppressed, it frequently remains inferable through correlated internal knowledge. We present the first graph-based evaluation framework that exposes severe overestimation of unlearning effectiveness in previous evaluations. On the safety side, from the same perspective, we show that most existing red-teaming and jailbreaking methods remain confined to a prompt-optimization paradigm that implicitly targets atomic facts, a strategy that increasingly fails against modern commercial LLMs. In contrast, we introduce a new attack paradigm that explores and weaves together benign knowledge fragments within the model’s internal structure, achieving over 95% success against state-of-the-art aligned models. Together, these results highlight a shared structural vulnerability underlying both unlearning failures and jailbreak robustness. Rethinking LLMs through the lens of structured internal knowledge offers a unifying framework for evaluating, attacking, and ultimately defending modern language models.
ABSTRACT: Large language models are increasingly deployed in high-impact settings, making trust and safety central concerns. A growing body of evidence suggests that many failures in these systems share a common root cause: knowledge in LLMs is not stored as isolated atomic facts, but as structured and interdependent internal representations. This talk argues for a shift from atomic views of model knowledge toward structured internal knowledge modeling, and shows how this perspective fundamentally reshapes our understanding of both unlearning and jailbreaking. On the trust side, by modeling an LLM’s internal correlated knowledge as a structured representation, we reveal why existing unlearning methods often achieve only superficial forgetting: even when a target fact is suppressed, it frequently remains inferable through correlated internal knowledge. We present the first graph-based evaluation framework that exposes severe overestimation of unlearning effectiveness in previous evaluations. On the safety side, from the same perspective, we show that most existing red-teaming and jailbreaking methods remain confined to a prompt-optimization paradigm that implicitly targets atomic facts, a strategy that increasingly fails against modern commercial LLMs. In contrast, we introduce a new attack paradigm that explores and weaves together benign knowledge fragments within the model’s internal structure, achieving over 95% success against state-of-the-art aligned models. Together, these results highlight a shared structural vulnerability underlying both unlearning failures and jailbreak robustness. Rethinking LLMs through the lens of structured internal knowledge offers a unifying framework for evaluating, attacking, and ultimately defending modern language models.
Google TechTalks
Google Tech Talks is a grass-roots program at Google for sharing information of interest to the technical community. At its best, it's part of an ongoing discussion about our world featuring top experts in diverse fields. Presentations range from the br...