Denoising Diffusion Models - Part 1
Denoising Diffusion Models - Theory Notes¶
Intro¶
This is not intended to be an exhaustive overview of DDPM theory - the following resources are helpful to understanding more;
SDE theory: https://arxiv.org/pdf/2011.13456.pdf
DDPM paper: https://arxiv.org/pdf/2006.11239.pdf
score matching theory: https://random-walks.org/content/misc/score-matching/score-matching.html
Agenda¶
- Cover the intuition of Denoising Diffusion models.
- Diffusion Models as ELBO maximisation.
- Applying Reverse RPP to achieve the simplified loss function.
The intuition¶
We want to build a generative model for our data. So, we devise the following recipe;
- Take each data point, and add a small amount of gausian noise.
- Repeat this step adding more and more noise up to some fixed number of steps, T.
- We should pick T and the size of the noise so that by the time we have applied T steps of noise, our original data is indistinguisable from Gaussian noise.
- Now, we learn a neural network that learns to reverse this process;
- To train it, we pick a t between 1 and T, and try to predict $x_{t-1}$ using $x_{t}$ and which step we are on, t.
- However, its even easier; $x_{t} = x_{t-1} + \epsilon$, so all we need to do is predict the noise.
- Now sampling is clear! We start with some random noise, and gradually use our model to remove noise a bit at a time, until we get something that isn't noise.
- If all this works well, our neural network is a (recursively applied) mapping from parts of the Gaussian domain onto structured spaces (for example, images), and so we can turn noise into unseen samples.
ELBO Maximisation¶
We have a model, which gives us the probability of our data, and has parameters $\theta$; $P_{\theta}(X)$.
However, the construction of the model is not amenable to computing P directly; $log P_{\theta}(X_{0}) = log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T}) dX_1 dX_2 ...dX_T$
This is the usual proble of variational inference, and so applying the ELBO gets us out of the mess.
What is different here is the direction - we are fixing the posterior and learning a model.
$ log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T}) dX_1 dX_2 ...dX_T = log \int_{X_1, X_2, ...,X_T}P_{\theta}(X_{0}, X_{1}, ..., X_{T})\frac{q(X_{1},...X_{T})}{q(X_{1},...X_{T}|X_{0})} dX_1 dX_2 ...dX_T$
$ = log E_{q}[\frac{P_{\theta}(X_{0}, X_{1}, ..., X_{T})}{q(X_{1},...X_{T}|X_{0})}]$
$ \geq E_{q}[log \frac{P_{\theta}(X_{0}, X_{1}, ..., X_{T})}{q(X_{1},...X_{T}|X_{0})}] = \mathcal{L}$
And we have the markov decomposition $q(X_{1},...X_{T}| X_{0}) = \prod_{t=1}^{T} q(X_{t}| X_{t-1})$ and $P_{\theta}(X_{0}, X_{1}, ..., X_{T}) = P(X_{T})\prod_{t=1}^{T}P_{\theta}(X_{t-1}|X_{t})$
Applying these we get;
$\mathcal{L} = E_{q}[-log P_{\theta}(X_{T}) - \sum_{t \geq 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t}|X_{t-1})}]$
Reverse process conditioning¶
Next, a small rewrite of this objective, to express the loss as a sum of KL divergences, helps us because we have very efficient tools for computing the KL divergence between two gaussians.
$E_{q}[-log P_{\theta}(X_{T}) - \sum_{t \gt 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t}|X_{t-1})} - log \frac{P_{\theta}(X_{0}|X_{1})}{q(X_{1}|X_{0})}]$
Bayes (and conditional indepedence) gives us;
$q(X_{t}|X_{t-1}) = q(X_{t}|X_{t-1}, X_{0}) = \frac{q(X_{t-1}|X_{t}, X_{0}) \cdot q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}$
Also, if we have a term like $\sum_{t \gt 1} log[ q(X_{t-1}|X_{t}, X_{0}) \cdot \frac{ q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}]$
We can rearrange to simplify like so;
$[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + [\sum_{t \gt 1}log \frac{ q(X_{t}|X_{0})}{q(X_{t-1}|X_{0})}]$ = $[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + [\sum_{t \gt 1}log q(X_{t}|X_{0})] - [\sum_{t \gt 1}log {q(X_{t-1}|X_{0})}]$
As you can see from the last two terms, we can do a lot of cancelling and write it as:
$=[\sum_{t \gt 1} log q(X_{t-1}|X_{t}, X_{0}) ] + log q(X_{T}|X_{0}) - log q(X_1|X_0)$
Applying this to our loss function, we end up with;
$ = E_{q}[-log \frac{P_{\theta}(X_{T})}{q(X_{T}|X_{0})} - \sum_{t \gt 1} \frac{log P_{\theta}(X_{t-1}|X_{t})}{q(X_{t-1}|X_{t}, X_{0})} - log P_{\theta}(X_{0}|X_{1})]$
Which is our desired result of a sum of KL divergences (plus a small extra term).
Last steps¶
The final steps in achieving the final loss function are;
- Apply the formula for the KLD between two gaussians.
- Use straight-forward sums of gaussians to compute $ q(X_{t}|X_{0}$ for any t.
- Reparamterise the neural network from predicting the mean (std is fixed here), to predicting the change in the mean i.e the noise.