But how do AI videos actually work? | Guest video by @WelchLabsVideo // TRAIN BRAIN

But how do AI videos actually work? | Guest video by @WelchLabsVideo

How diffusion models and CLIP turn text into images
Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book
Sections
0:00 - Intro
3:37 - CLIP
6:25 - Shared Embedding Space
8:16 - Diffusion Models & DDPM
11:44 - Learning Vector Fields
22:00 - DDIM
25:25 Dall E 2
26:37 - Conditioning
30:02 - Guidance
33:39 - Negative Prompts
34:27 - Outro
35:32 - About guest videos + Grant’s Reaction
Special Thanks to:
Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper.
https://arxiv.org/pdf/2006.11239
https://arxiv.org/pdf/2207.12598
Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial:
https://arxiv.org/pdf/2406.08929
Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion
Cheyang also has a terrific tutorial and MIT course on diffusion models
https://www.chenyang.co/diffusion.html
https://www.practical-diffusion.org/
Other References
All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/
CLIP Paper: https://arxiv.org/pdf/2103.00020
DDIM Paper: https://arxiv.org/pdf/2010.02502
Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456
Wan2.1: https://github.com/Wan-Video/Wan2.1
Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2
Midjourney: https://www.midjourney.com/
Veo: https://deepmind.google/models/veo/
DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf
Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora
Written by: Stephen Welch, with very helpful feedback from Grant Sanderson
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Technical Notes
The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper.
6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf
Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset.
The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512](https://stackoverflow.com/a/79243065)
For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler.
At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations.
Premium Beat Music ID: EEDYZ3FP44YX8OWT

3Blue1Brown

3blue1brown, by Grant Sanderson, is some combination of math and entertainment, depending on your disposition. The goal is for explanations to be driven by animations and for difficult problems to be made simple with changes in perspective. Contact and F...

The most absurd product I've made

The dynamics of e^(πi)

But what is a Laplace Transform?

The dynamics of e^(πi)

How AI connects text and images

The AI that solved IMO Geometry Problems | Guest video by @Aleph0

But how do AI videos actually work? | Guest video by @WelchLabsVideo

Summer of Math Exposition #4 | Teachers, I'd love to hear from you

Where my explanation of Grover’s algorithm failed

But what is Quantum Computing? (Grover's Algorithm)

Testing your intuition for quantum computing

How to measure nearby galaxies

Measuring the distance to Venus without radar

Measuring the speed of light using Jupiter's moons

The tragic tale of Guillaume Le Gentil

Zooming out by powers of 10

There's more to those colliding blocks that compute pi

When being beautifully wrong leads to discovery

Why the ancient Greek's rejected heliocentrism

How to estimate the distance to the sun

How Aristarchus deduced the distance to the moon

The cosmic distance ladder with Terence Tao, part 2

How Earth's size was computed by Eratosthenes

Measuring the earth with Terence Tao

The space of all musical intervals

The barber pole optical mystery

Monge's Theorem

Thinking through double slits

The inscribed square problem

The meaning within the Mandelbrot set

The scale of training LLMs

This puzzle is tricker than it seems

The Triangle Of Power

The twirling tiles puzzle

Why 4d geometry makes me sad

LLMs are next-word predictors

Hologram preview

Holograms are wild

In the vector space of all advice...

Temperature in LLMs

Simulating the electric field and a moving charge

How the Mandelbrot set is defined

A challenging puzzle about subset sums

Ellipses have multiple definitions, how are these the same?

Three levels of understanding Bayes' theorem

The medical test paradox (well "paradox")

Positioned as the hardest question on a Putnam exam (#6, 1992)

Why does light slowing imply a bend? (Beyond the tank/car analogy)

The cube shadow puzzle

What does it mean that light "slows down" in glass?

Why do we call them "scalars"?

A beautiful international math olympiad problem

Definition of a "bit", in information theory

The Newton art puzzle

What is a group?

How to derive a formula for π

The limit of limiting arguments

For anyone who might not know how links in shorts work

Infinite Lighthouses and π

"Proof" of a sphere's surface area

Can you even imagine 2^256?

Order from chaos

The surface area of a sphere

A short on shorts

A pretty way to add weighted dice

Image blurring

These integrals all equal π, until...

The split necklace puzzle (with a surprise topological solution)

Error correction is incredible

The chessboard and coins puzzle

Prime spirals

The barber pole effect

Fourier series

Seeing with sound

The standard explanation of a prism falls short

I'm still astounded this is true

Don't let it fool you!

Prime spirals

X-rays "faster" than the speed of light, and more refractive index questions | Optics puzzles part 4

You can't explain prisms without understanding springs | Optics puzzles part 3