# Deriving the Evidence Lower Bound¶

## Recap¶

In the previous post, I talked about different types of bounds we could have, and why we need to use them. As a quick recap, we are looking to compute the marginalization of a joint distribution in a general setting:

$P(X) = \int_{Z} P(X, Z) dZ$

Previously, I showed how this can be an incredibly difficult thing to compute using standard methods, such as numerical integration, because of the curse of dimensionality.

## Evidence¶

The quantity we are going to derive, in a few ways, is the ELBO, or Evidence Lower BOund. As suggested by the name, it is a bound on the so-called Model Evidence, (also termed the probability of the data), $P(X)$.

### Via Jensen's Inequality¶

Let us start with the premise that we wish to find $log P(X)$.

$log P(X) = log [\int_{Z} P(X,Z) dZ]$

Introducing another distribution on Z, we can rewrite as;

$log P(X) = log [\int_{Z} P(X,Z)\frac{q(Z)}{q(Z)} dZ] = log \ E_{q}[\frac{P(X,Z)}{q(Z)}]$

By Jensen's Inequality we know $log \ E_{q}[\frac{P(X,Z)}{q(Z)}] \geq \ E_{q}[log \frac{P(X,Z)}{q(Z)}]$

This gives us;

$ log \ P(X) = log \ E_{q}[\frac{P(X,Z)}{q(Z)}] \geq \ E_{q}[log \frac{P(X,Z)}{q(Z)}]$

$ log \ P(X) \geq \ E_{q}[log \frac{P(X,Z)}{q(Z)}] = \mathcal{L}$

From this, the role of the ELBO is obvious; it is a lower bound on the "evidence", $P(X)$, so we can use this to get an approximation for the evidence.

### Via KL-Divergence¶

We can also get to the ELBO from a completely different route.

Often, we actually want to the approximate the posterior $P(Z|X)$. This is where our distribution $q(Z)$ comes in: we want to choose some distribution, $q(Z)$ to approximate the true posterior $P(Z \mid X)$.

A really common way to measure the similarity between two probability distributions is the KL-divergence. This is a non-negative measure of similarity, that is 0 for identical distributions.

So, in order to encode the idea that our approximation is close, we want to have some 'small' (what ever that means...) KL-Divergence.

$D_{KL}(q(Z)\mid\mid P(Z|X)) = -E_{q}[log\frac{P(Z\mid X)}{q(Z)}] = -E_{q}[log \ P(Z\mid X) - log \ q(Z)] $

$-E_{q}[log \ P(Z\mid X) - log \ q(Z)] = -E_{q}[log \frac{ P(Z, X)}{P(X)} - log \ q(Z)] = -E_{q}[log \ P(Z, X) \ - log \ P(X) - log \ q(Z)]$

$log P(X)$ is independant of Z, so it's expectation under $q(Z)$ is itself. We can thus split the last step into two terms:

$-E_{q}[log \ P(Z, X) \ - log \ P(X) - log \ q(Z)] = -E_{q}[log \ P(Z, X) \ - log \ q(Z)] + E_{q}[log \ P(X)] $

Therefore, we arrive at;

$D_{KL}(q(z)\mid\mid P(Z\mid X))= -E_{q}[log \ P(Z, X) \ - log \ q(Z)] + log \ P(X)$

The expectation term is the same as the term we called $\mathcal{L}$ in the previous section, so rewriting;

$D_{KL}(q(z)\mid\mid P(Z\mid X))= -\mathcal{L} + log \ P(X)$

The original goal was to find an approximation $q(z)$ that is close to the true posterior. So we are varying $q$ in order to minimise the KL-Divergence between it and the posterior. The log evidence is independent of $q$, so it doesn't matter how we vary $q$, it is just a constant term. So to minimise the LHS, we have to minimise $-\mathcal{L}$, which is equivalent to maximising $\mathcal{L}$.

Looking at the ELBO from this perspective, we see that the $q$ that maximises the ELBO also minimises the KL-Divergence between itself and the true posterior.

# Summary¶

From the above two angles, we can see the roles that the component parts of the ELBO play. The ELBO itself is a lower bound on the evidence, whilst the distribution q serves as the approximation of the true posterior.