Denoising Diffusion Models - Theory Notes

Intro

This is not intended to be an exhaustive overview of DDPM theory - the following resources are helpful to understanding more;

SDE theory: https://arxiv.org/pdf/2011.13456.pdf
DDPM paper: https://arxiv.org/pdf/2006.11239.pdf
score matching theory: https://random-walks.org/content/misc/score-matching/score-matching.html

Agenda

  1. Cover the intuition of Denoising Diffusion models.
  2. Diffusion Models as ELBO maximisation.
  3. Applying Reverse RPP to achieve the simplified loss function.

The intuition

We want to build a generative model for our data. So, we devise the following recipe;

  • Take each data point, and add a small amount of gausian noise.
  • Repeat this step adding more and more noise up to some fixed number of steps, T.
  • We should pick T and the size of the noise so that by the time we have applied T steps of noise, our original data is indistinguisable from Gaussian noise.
  • Now, we learn a neural network that learns to reverse this process;
  • To train it, we pick a t between 1 and T, and try to predict $x_{t-1}$ using $x_{t}$ and which step we are on, t.
  • However, its even easier; $x_{t} = x_{t-1} + \epsilon$, so all we need to do is predict the noise.
  • Now sampling is clear! We start with some random noise, and gradually use our model to remove noise a bit at a time, until we get something that isn't noise.
  • If all this works well, our neural network is a (recursively applied) mapping from parts of the Gaussian domain onto structured spaces (for example, images), and so we can turn noise into unseen samples.

ELBO Maximisation

We have a model, which gives us the probability of our data, and has parameters $\theta$; $P_{\theta}(X)$.

However, the construction of the model is not amenable to computing P directly; $log P_{\theta}(X_{0}) = log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T}) dX_1 dX_2 ...dX_T$

This is the usual proble of variational inference, and so applying the ELBO gets us out of the mess.

What is different here is the direction - we are fixing the posterior and learning a model.

$ log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T}) dX_1 dX_2 ...dX_T = log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T})\frac{q(X_{1},...X_{T})}{q(X_{1},...X_{T}|X_{0})} dX_1 dX_2 ...dX_T$

$ = log E_{q}[\frac{P_{\theta}(X_{0}, X_{1}, ..., X_{T})}{q(X_{1},...X_{T}|X_{0})}]$

$ \geq E_{q}[log \frac{P_{\theta}(X_{0}, X_{1}, ..., X_{T})}{q(X_{1},...X_{T}|X_{0})}] = \mathcal{L}$

And we have the markov decomposition $q(X_{1},...X_{T}| X_{0}) = \prod_{t=1}^{T} q(X_{t}| X_{t-1})$ and $P_{\theta}(X_{0}, X_{1}, ..., X_{T}) = P(X_{T})\prod_{t=1}^{T}P_{\theta}(X_{t-1}|X_{t})$

Applying these we get;

$\mathcal{L} = E_{q}[-log P_{\theta}(X_{T}) - \sum_{t \geq 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t}|X_{t-1})}]$

Reverse process conditioning

Next, a small rewrite of this objective, to express the loss as a sum of KL divergences, helps us because we have very efficient tools for computing the KL divergence between two gaussians.

$E_{q}[-log P_{\theta}(X_{T}) - \sum_{t \gt 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t}|X_{t-1})} - log \frac{P_{\theta}(X_{0}|X_{1})}{q(X_{1}|X_{0})}]$

Bayes (and conditional indepedence) gives us;

$q(X_{t}|X_{t-1}) = q(X_{t}|X_{t-1}, X_{0}) = \frac{q(X_{t-1}|X_{t}, X_{0}) \cdot q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}$

Also, if we have a term like $\sum_{t \gt 1} log[ q(X_{t-1}|X_{t}, X_{0}) \cdot \frac{ q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}]$

We can rearrange to simplify like so;

$[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + [\sum_{t \gt 1}log \frac{ q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}]$ = $[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + [\sum_{t \gt 1}log q(X_{t}|X_{0})] - [\sum_{t \gt 1}log {q(X_{t-1}|X_{0})}]$

As you can see from the last two terms, we can do a lot of cancelling and write it as:

$=[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + log q(X_{T}|X_{0}) - log q(X_1|X_0)$

Applying this to our loss function, we end up with;

$ = E_{q}[-log \frac{P_{\theta}(X_{T})}{q(X_{T}|X_{0})} - \sum_{t \gt 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t-1}|X_{t}, X_{0})} - log P_{\theta}(X_{0}|X_{1})]$

Which is our desired result of a sum of KL divergences (plus a small extra term).

Last steps

The final steps in achieving the final loss function are;

  1. Apply the formula for the KLD between two gaussians.
  2. Use straight-forward sums of gaussians to compute $ q(X_{t}|X_{0}$ for any t.
  3. Reparamterise the neural network from predicting the mean (std is fixed here), to predicting the change in the mean i.e the noise.