Renyi Alpha Divergence

In many areas of machine learning, we use the KL-divergence to measure the 'distance' between probability distributions. However, the KL-divergence is a special case of a wider range of $\alpha$-family divergences. One of interest in the VI literature is the Renyi $\alpha$ divergence, and this post is a short note on this family. This post is one of a series, and this post in mainly theory based on Renyi Divergence Variational Inference, submitted to NIPS 2016.

Alpha divergence recap

The alpha divergence is given by;

$D_{\alpha}[q\mid \mid p] =\frac{1}{\alpha-1} log \int q(x)^{\alpha}p(x)^{1-\alpha}dx$

$ = \frac{1}{\alpha-1} log \int q(x) q(x)^{\alpha - 1}p(x)^{1-\alpha}dx$

$ = \frac{1}{\alpha-1} log \int q(x) (\frac{p(x)}{q(x)})^{1-\alpha}dx$

$ = \frac{1}{\alpha-1} log E_{q}[(\frac{p(x)}{q(x)})^{1-\alpha}]$

Why Renyi Alpha divergence?

The presence of the $\alpha$ allows us to tailor the strength of the bound that we use. If we take $\alpha$ as 0, we recover:

$D_{0}[q\mid \mid p] =-log \int p(x)dx = 0$

This means, we recover the maximum likelihood solution (i.e. the bound is exact), and if we set $\alpha$ to 1, we recover the KL divergence. By choosing some other value, we can generate different behaviour to manipulate the type and tightness of the bound. Many of these points will become clear later in this post, or in subsequent posts.

If we take the limit of $\alpha \rightarrow 1$, we recover the KL-divergence, and we prove this now.

$\lim_{\alpha\to 1} \frac{1}{\alpha-1} log \int q(x)^{\alpha}p(x)^{1-\alpha}dx$

However, if we take this limit, the integral becomes $log(1) = 0$, and the denominator of the multiplier becomes 0.

This means we have a limit in the form of $\frac{0}{0}$. This form of indeterminate limit can be solved by using L'Hopitals Rule.

In short, we take the derivative of the numerator with respect to the limiting variable, $\alpha$, and do the same denominator, and the take the limit of the fraction of the derivatives instead.

The derivative of the denominator is easy - its 1.

$\frac{d}{d\alpha} log \int q(x)^{\alpha}p(x)^{1-\alpha}dx$

$= \frac{1}{\int q(x)^{\alpha}p(x)^{1-\alpha}dx}\frac{d}{d\alpha} \int q(x)^{\alpha}p(x)^{1-\alpha}dx$

The final step is to compute the derivative of the integral - using Leibniz we can take the derivative under the integral.

$= \frac{1}{\int q(x)^{\alpha}p(x)^{1-\alpha}dx} \int \frac{d}{d\alpha} e^{\alpha \ log q(x)} e^{(1-\alpha) \ log p(x)} dx$

$= \frac{1}{\int q(x)^{\alpha}p(x)^{1-\alpha}dx} \int p(x)^{1-\alpha}q(x)^{\alpha}log \ q(x) - q(x)^{\alpha}p(x)^{1-\alpha} log \ p(x)$

$\lim_{\alpha \to 1} \frac{1}{\int q(x)^{\alpha}p(x)^{1-\alpha}dx} \int p(x)^{1-\alpha}q(x)^{\alpha}log \ q(x) - q(x)^{\alpha}p(x)^{1-\alpha} log \ p(x) dx$

$= \frac{1}{\int q(x)dx} \int q(x)log \ q(x) - q(x) log \ p(x) dx$

$= \int q(x)log \ q(x) - q(x) log \ p(x) dx \ \ = \ KL(q \mid \mid p)$

Deriving the Renyi Objective

The previous section provides us with some guidance about how to go about deriving a Renyi based bound on the marginal likelihood. If we recall the derivation of the ELBO, we will proceed along a similar path, except using Renyi divergences rather than KL. Using the form derived in section 1:

$D_{\alpha}(q(z)\mid \mid p(z \mid x)) = \frac{1}{\alpha -1} log \ E_{q}[(\frac{p(z \mid x)}{q(z)})^{1-\alpha}]$

Expanding the posterior over z under p, $p(z \mid x)$, using bayes theorem, we get:

$p(z \mid x) = \frac{p(z, x)}{p(x)}$

$D_{\alpha}(q(z)\mid \mid p(z \mid x)) = \frac{1}{\alpha -1} log \ E_{q}[(\frac{p(z, x)}{p(x)q(z)})^{1-\alpha}]$

Rearranging slightly, and then separating terms:

$D_{\alpha}(q(z)\mid \mid p(z \mid x)) = \frac{1}{\alpha -1} log \ E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha} \ \frac{1}{p(x)^{1-\alpha}}] =\frac{1}{\alpha -1} log \ E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha} p(x)^{\alpha -1}] $

In the last term, the marginal likelihood is a constant with respect to the expectation, so we can rewrite this term on the outside of the expectation:

$=\frac{1}{\alpha -1} log \ ( E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha}] p(x)^{\alpha -1}) $

$=\frac{1}{\alpha -1} (log \ E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha}] \ + \ log \ p(x)^{\alpha -1}) $

$=\frac{1}{\alpha -1} log \ E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha}] \ + \ log \ p(x) $

So, we now have a familiar form for the first term, this is a Renyi Divergence between the joint and the approximation, and the second term is the log marginal likelihood.

$D_{\alpha}(q(z)\mid \mid p(z \mid x)) =\frac{1}{\alpha -1} log \ E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha}] \ + \ log \ p(x) $

$D_{\alpha}(q(z)\mid \mid p(z \mid x)) =- \frac{1}{1 - \alpha} log \ E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha}] \ + \ log \ p(x) $

$D_{\alpha}(q(z)\mid \mid p(z \mid x)) =- \mathcal{L_\alpha} \ + \ log \ p(x) $

where $\mathcal{L_\alpha} = \frac{1}{1 - \alpha} log \ E_{q}[(\frac{p(z, x)}{q(z)})^{1-\alpha}] $

Here, we have an equivalent formulation to that which we ended up with when we derived the ELBO. In order to minimise the Renyi divergence between the true and approximate posterior, we should maximise the Renyi bound, $\mathcal{L_{\alpha}}$.

As mentioned earlier, if we choose $\alpha$ to be 0, it is easy to show that $D_{\alpha}(q(z)\mid \mid p(z \mid x))$ becomes 0, and so $\mathcal{L_{0}}$ simplifies to nothing more than integrating the z out of the joint, hence it is exact. As the Renyi divergence is continuous in $\alpha$, by picking a value between 0 and 1, we get somewhere between an exact and ELBO solution.

Practicalities of using the Renyi Alpha divergence.

When we use the KL divergence, we approximate the expectations by monte carlo, as covered in a previous post. For the KL divergence, we are lucky that this happens to be an unbiased estimate of the true expectation, so we can use this in our optimisation procedure confident of our convergence.

Let's do the same with the Renyi bound:

$\hat{\mathcal{L_\alpha}} = \frac{1}{1 - \alpha} log \ \frac{1}{N} \sum_{n}^{N} [(\frac{p(z_{n}, x)}{q(z_{n})})^{1-\alpha}]$

For this to be unbiased, if we take the expectation it should be equal to the true value.

$E_{q}[\hat{\mathcal{L_\alpha}}] = E_{q}[ \frac{1}{1 - \alpha} log \ \frac{1}{N} \sum_{n}^{N} [(\frac{p(z_{n}, x)}{q(z_{n})})^{1-\alpha}]]$

$= \frac{1}{1 - \alpha} E_{q}[ log \ \frac{1}{N} \sum_{n}^{N} [(\frac{p(z_{n}, x)}{q(z_{n})})^{1-\alpha}]]$

This is somewhat problematic though, as the log complicates matters. By Jensen's inequality (recalling that the log is convex):

$\frac{1}{1 - \alpha} E_{q}[ log \ \frac{1}{N} \sum_{n}^{N} [(\frac{p(z_{n}, x)}{q(z_{n})})^{1-\alpha}]] \le \frac{1}{1 - \alpha} log \ \frac{1}{N} E_{q}[ \sum_{n}^{N} [(\frac{p(z_{n}, x)}{q(z_{n})})^{1-\alpha}]]$

$\frac{1}{1 - \alpha} E_{q}[ log \ \frac{1}{N} \sum_{n}^{N} [(\frac{p(z_{n}, x)}{q(z_{n})})^{1-\alpha}]] \le \frac{1}{1 - \alpha} log \ E_{q}[(\frac{p(z_{n}, x)}{q(z_{n})})^{1-\alpha}]$

This is not great, as we are trying to maximise $\mathcal{L_{\alpha}}$. So our estimate of the bound will actually be a underestimate, making the bound less tight.

Summary

In this post, I have introduced the Renyi alpha divergence, and we have seen some important properties. We have shown that the KL-divergence is a special case, shown how we can derive a Renyi Lower Bound, and shown that a monte carlo approximation of this bound is biased to underestimate the bound.

In the next post, I will show some more properties of the Renyi divergence bound, as well as use it in a practical problem or two!