Denoising Diffusion Models - Part 1
Denoising Diffusion Models - Theory Notes¶
Intro¶
This is not intended to be an exhaustive overview of DDPM theory - the following resources are helpful to understanding more;
SDE theory: https://arxiv.org/pdf/2011.13456.pdf
DDPM paper: https://arxiv.org/pdf/2006.11239.pdf
score matching theory: https://random-walks.org/content/misc/score-matching/score-matching.html
Agenda¶
- Cover the intuition of Denoising Diffusion models.
- Diffusion Models as ELBO maximisation.
- Applying Reverse RPP to achieve the simplified loss function.
The intuition¶
We want to build a generative model for our data. So, we devise the following recipe;
- Take each data point, and add a small amount of gausian noise.
- Repeat this step adding more and more noise up to some fixed number of steps, T.
- We should pick T and the size of the noise so that by the time we have applied T steps of noise, our original data is indistinguisable from Gaussian noise.
- Now, we learn a neural network that learns to reverse this process;
- To train it, we pick a t between 1 and T, and try to predict xt−1 using xt and which step we are on, t.
- However, its even easier; xt=xt−1+ϵ, so all we need to do is predict the noise.
- Now sampling is clear! We start with some random noise, and gradually use our model to remove noise a bit at a time, until we get something that isn't noise.
- If all this works well, our neural network is a (recursively applied) mapping from parts of the Gaussian domain onto structured spaces (for example, images), and so we can turn noise into unseen samples.
ELBO Maximisation¶
We have a model, which gives us the probability of our data, and has parameters θ; Pθ(X).
However, the construction of the model is not amenable to computing P directly; logPθ(X0)=log∫X1,X2,...,XTPθ(X0,X1,...,XT)dX1dX2...dXT
This is the usual proble of variational inference, and so applying the ELBO gets us out of the mess.
What is different here is the direction - we are fixing the posterior and learning a model.
log∫X1,X2,...,XTPθ(X0,X1,...,XT)dX1dX2...dXT=log∫X1,X2,...,XTPθ(X0,X1,...,XT)q(X1,...XT)q(X1,...XT|X0)dX1dX2...dXT
=logEq[Pθ(X0,X1,...,XT)q(X1,...XT|X0)]
≥Eq[logPθ(X0,X1,...,XT)q(X1,...XT|X0)]=L
And we have the markov decomposition q(X1,...XT|X0)=∏Tt=1q(Xt|Xt−1) and Pθ(X0,X1,...,XT)=P(XT)∏Tt=1Pθ(Xt−1|Xt)
Applying these we get;
L=Eq[−logPθ(XT)−∑t≥1logPθ(Xt−1|Xt)q(Xt|Xt−1)]
Reverse process conditioning¶
Next, a small rewrite of this objective, to express the loss as a sum of KL divergences, helps us because we have very efficient tools for computing the KL divergence between two gaussians.
Eq[−logPθ(XT)−∑t>1logPθ(Xt−1|Xt)q(Xt|Xt−1)−logPθ(X0|X1)q(X1|X0)]
Bayes (and conditional indepedence) gives us;
q(Xt|Xt−1)=q(Xt|Xt−1,X0)=q(Xt−1|Xt,X0)⋅q(Xt|X0)q(Xt−1|X0)
Also, if we have a term like ∑t>1log[q(Xt−1|Xt,X0)⋅q(Xt|X0)q(Xt−1|X0)]
We can rearrange to simplify like so;
[∑t>1logq(Xt−1|Xt,X0)]+[∑t>1logq(Xt|X0)q(Xt−1|X0)] = [∑t>1logq(Xt−1|Xt,X0)]+[∑t>1logq(Xt|X0)]−[∑t>1logq(Xt−1|X0)]
As you can see from the last two terms, we can do a lot of cancelling and write it as:
=[∑t>1logq(Xt−1|Xt,X0)]+logq(XT|X0)−logq(X1|X0)
Applying this to our loss function, we end up with;
=Eq[−logPθ(XT)q(XT|X0)−∑t>1logPθ(Xt−1|Xt)q(Xt−1|Xt,X0)−logPθ(X0|X1)]
Which is our desired result of a sum of KL divergences (plus a small extra term).
Last steps¶
The final steps in achieving the final loss function are;
- Apply the formula for the KLD between two gaussians.
- Use straight-forward sums of gaussians to compute q(Xt|X0 for any t.
- Reparamterise the neural network from predicting the mean (std is fixed here), to predicting the change in the mean i.e the noise.