But how do AI videos actually work? | Guest video by @WelchLabsVideo
How diffusion models and CLIP turn text into images
Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book
Sections
0:00 - Intro
3:37 - CLIP
6:25 - Shared Embedding Space
8:16 - Diffusion Models & DDPM
11:44 - Learning Vector Fields
22:00 - DDIM
25:25 Dall E 2
26:37 - Conditioning
30:02 - Guidance
33:39 - Negative Prompts
34:27 - Outro
35:32 - About guest videos + Grant’s Reaction
Special Thanks to:
Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper.
https://arxiv.org/pdf/2006.11239
https://arxiv.org/pdf/2207.12598
Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial:
https://arxiv.org/pdf/2406.08929
Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion
Cheyang also has a terrific tutorial and MIT course on diffusion models
https://www.chenyang.co/diffusion.html
https://www.practical-diffusion.org/
Other References
All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/
CLIP Paper: https://arxiv.org/pdf/2103.00020
DDIM Paper: https://arxiv.org/pdf/2010.02502
Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456
Wan2.1: https://github.com/Wan-Video/Wan2.1
Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2
Midjourney: https://www.midjourney.com/
Veo: https://deepmind.google/models/veo/
DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf
Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora
Written by: Stephen Welch, with very helpful feedback from Grant Sanderson
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Technical Notes
The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper.
6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf
Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset.
The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512](https://stackoverflow.com/a/79243065)
For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler.
At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations.
Premium Beat Music ID: EEDYZ3FP44YX8OWT
3Blue1Brown
3blue1brown, by Grant Sanderson, is some combination of math and entertainment, depending on your disposition. The goal is for explanations to be driven by animations and for difficult problems to be made simple with changes in perspective. Contact and F...