Denoising Diffusion Models - Theory Notes¶

Intro¶

This is not intended to be an exhaustive overview of DDPM theory - the following resources are helpful to understanding more;

SDE theory: https://arxiv.org/pdf/2011.13456.pdf
DDPM paper: https://arxiv.org/pdf/2006.11239.pdf
score matching theory: https://random-walks.org/content/misc/score-matching/score-matching.html

Agenda¶

Cover the intuition of Denoising Diffusion models.
Diffusion Models as ELBO maximisation.
Applying Reverse RPP to achieve the simplified loss function.

The intuition¶

We want to build a generative model for our data. So, we devise the following recipe;

Take each data point, and add a small amount of gausian noise.
Repeat this step adding more and more noise up to some fixed number of steps, T.
We should pick T and the size of the noise so that by the time we have applied T steps of noise, our original data is indistinguisable from Gaussian noise.
Now, we learn a neural network that learns to reverse this process;
To train it, we pick a t between 1 and T, and try to predict $x_{t-1}$ using $x_{t}$ and which step we are on, t.
However, its even easier; $x_{t} = x_{t-1} + \epsilon$, so all we need to do is predict the noise.
Now sampling is clear! We start with some random noise, and gradually use our model to remove noise a bit at a time, until we get something that isn't noise.
If all this works well, our neural network is a (recursively applied) mapping from parts of the Gaussian domain onto structured spaces (for example, images), and so we can turn noise into unseen samples.

ELBO Maximisation¶

We have a model, which gives us the probability of our data, and has parameters $\theta$; $P_{\theta}(X)$.

However, the construction of the model is not amenable to computing P directly; $log P_{\theta}(X_{0}) = log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T}) dX_1 dX_2 ...dX_T$

This is the usual proble of variational inference, and so applying the ELBO gets us out of the mess.

What is different here is the direction - we are fixing the posterior and learning a model.

$ log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T}) dX_1 dX_2 ...dX_T = log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T})\frac{q(X_{1},...X_{T})}{q(X_{1},...X_{T}|X_{0})} dX_1 dX_2 ...dX_T$

$ = log E_{q}[\frac{P_{\theta}(X_{0}, X_{1}, ..., X_{T})}{q(X_{1},...X_{T}|X_{0})}]$

$ \geq E_{q}[log \frac{P_{\theta}(X_{0}, X_{1}, ..., X_{T})}{q(X_{1},...X_{T}|X_{0})}] = \mathcal{L}$

And we have the markov decomposition $q(X_{1},...X_{T}| X_{0}) = \prod_{t=1}^{T} q(X_{t}| X_{t-1})$ and $P_{\theta}(X_{0}, X_{1}, ..., X_{T}) = P(X_{T})\prod_{t=1}^{T}P_{\theta}(X_{t-1}|X_{t})$

Applying these we get;

$\mathcal{L} = E_{q}[-log P_{\theta}(X_{T}) - \sum_{t \geq 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t}|X_{t-1})}]$

Reverse process conditioning¶

Next, a small rewrite of this objective, to express the loss as a sum of KL divergences, helps us because we have very efficient tools for computing the KL divergence between two gaussians.

$E_{q}[-log P_{\theta}(X_{T}) - \sum_{t \gt 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t}|X_{t-1})} - log \frac{P_{\theta}(X_{0}|X_{1})}{q(X_{1}|X_{0})}]$

Bayes (and conditional indepedence) gives us;

$q(X_{t}|X_{t-1}) = q(X_{t}|X_{t-1}, X_{0}) = \frac{q(X_{t-1}|X_{t}, X_{0}) \cdot q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}$

Also, if we have a term like $\sum_{t \gt 1} log[ q(X_{t-1}|X_{t}, X_{0}) \cdot \frac{ q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}]$

We can rearrange to simplify like so;

$[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + [\sum_{t \gt 1}log \frac{ q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}]$ = $[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + [\sum_{t \gt 1}log q(X_{t}|X_{0})] - [\sum_{t \gt 1}log {q(X_{t-1}|X_{0})}]$

As you can see from the last two terms, we can do a lot of cancelling and write it as:

$=[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + log q(X_{T}|X_{0}) - log q(X_1|X_0)$

Applying this to our loss function, we end up with;

$ = E_{q}[-log \frac{P_{\theta}(X_{T})}{q(X_{T}|X_{0})} - \sum_{t \gt 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t-1}|X_{t}, X_{0})} - log P_{\theta}(X_{0}|X_{1})]$

Which is our desired result of a sum of KL divergences (plus a small extra term).

Last steps¶

The final steps in achieving the final loss function are;

Apply the formula for the KLD between two gaussians.
Use straight-forward sums of gaussians to compute $ q(X_{t}|X_{0}$ for any t.
Reparamterise the neural network from predicting the mean (std is fixed here), to predicting the change in the mean i.e the noise.

Infinite n♾rm

Denoising Diffusion Models - Part 1