Deriving the Evidence Lower Bound

Recap

In the previous post, I talked about different types of bounds we could have, and why we need to use them. As a quick recap, we are looking to compute the marginalization of a joint distribution in a general setting:

$P(X) = \int_{Z} P(X, Z) dZ$

Previously, I showed how this can be an incredibly difficult thing to compute using standard methods, such as numerical integration, because of the curse of dimensionality.

Evidence

The quantity we are going to derive, in a few ways, is the ELBO, or Evidence Lower BOund. As suggested by the name, it is a bound on the so-called Model Evidence, (also termed the probability of the data), $P(X)$.

Via Jensen's Inequality

Let us start with the premise that we wish to find $log P(X)$.

$log P(X) = log [\int_{Z} P(X,Z) dZ]$

Introducing another distribution on Z, we can rewrite as;

$log P(X) = log [\int_{Z} P(X,Z)\frac{q(Z)}{q(Z)} dZ] = log \ E_{q}[\frac{P(X,Z)}{q(Z)}]$

By Jensen's Inequality we know $log \ E_{q}[\frac{P(X,Z)}{q(Z)}] \geq \ E_{q}[log \frac{P(X,Z)}{q(Z)}]$

This gives us;

$ log \ P(X) = log \ E_{q}[\frac{P(X,Z)}{q(Z)}] \geq \ E_{q}[log \frac{P(X,Z)}{q(Z)}]$

$ log \ P(X) \geq \ E_{q}[log \frac{P(X,Z)}{q(Z)}] = \mathcal{L}$

From this, the role of the ELBO is obvious; it is a lower bound on the "evidence", $P(X)$, so we can use this to get an approximation for the evidence.

Via KL-Divergence

We can also get to the ELBO from a completely different route.

Often, we actually want to the approximate the posterior $P(Z|X)$. This is where our distribution $q(Z)$ comes in: we want to choose some distribution, $q(Z)$ to approximate the true posterior $P(Z \mid X)$.

A really common way to measure the similarity between two probability distributions is the KL-divergence. This is a non-negative measure of similarity, that is 0 for identical distributions.

So, in order to encode the idea that our approximation is close, we want to have some 'small' (what ever that means...) KL-Divergence.

$D_{KL}(q(Z)\mid\mid P(Z|X)) = -E_{q}[log\frac{P(Z\mid X)}{q(Z)}] = -E_{q}[log \ P(Z\mid X) - log \ q(Z)] $

$-E_{q}[log \ P(Z\mid X) - log \ q(Z)] = -E_{q}[log \frac{ P(Z, X)}{P(X)} - log \ q(Z)] = -E_{q}[log \ P(Z, X) \ - log \ P(X) - log \ q(Z)]$

$log P(X)$ is independant of Z, so it's expectation under $q(Z)$ is itself. We can thus split the last step into two terms:

$-E_{q}[log \ P(Z, X) \ - log \ P(X) - log \ q(Z)] = -E_{q}[log \ P(Z, X) \ - log \ q(Z)] + E_{q}[log \ P(X)] $

Therefore, we arrive at;

$D_{KL}(q(z)\mid\mid P(Z\mid X))= -E_{q}[log \ P(Z, X) \ - log \ q(Z)] + log \ P(X)$

The expectation term is the same as the term we called $\mathcal{L}$ in the previous section, so rewriting;

$D_{KL}(q(z)\mid\mid P(Z\mid X))= -\mathcal{L} + log \ P(X)$

The original goal was to find an approximation $q(z)$ that is close to the true posterior. So we are varying $q$ in order to minimise the KL-Divergence between it and the posterior. The log evidence is independent of $q$, so it doesn't matter how we vary $q$, it is just a constant term. So to minimise the LHS, we have to minimise $-\mathcal{L}$, which is equivalent to maximising $\mathcal{L}$.

Looking at the ELBO from this perspective, we see that the $q$ that maximises the ELBO also minimises the KL-Divergence between itself and the true posterior.

Summary

From the above two angles, we can see the roles that the component parts of the ELBO play. The ELBO itself is a lower bound on the evidence, whilst the distribution q serves as the approximation of the true posterior.