Reparameterization Trick

Introduction

In the previous post, we did our first practical variational algorithm in tensorflow. However, as briefly mentioned, using the Monte Carlo estimate of the ELBO relies on gradients working correctly across the sampling distribution. In this post we will discuss that point in detail.

Why must sampling be differentiable?

If we recall the lower bound;

$\mathcal{L} = \int_{Z} q(Z) log \ \frac{P(Z, X)}{q(Z)} dZ$

Our strategy when pursuing the monte carlo estimate was to approximate the integral by a sum, and the distribution $q(Z)$ by sampling.

The important thing to remember, is that when we take the gradient with respect to the parameters of the ELBO , it is easy to not incorporate the effect of the parameters on the expectation.

As a simple example, consider sampling from a uniform distribution $U[a, b]$. If we change the values of a and b, then obviously this will change our samples! If we generate samples from this distribution outside of tensorflow, this gradient information will be lost.

We can see this requirement if we want to compute the gradient of the expectation of a function of x, a and b with respect to the parameters. Let's take the example of the continuous uniform parameterized by a and b.

$m = \int_{a}^{b} f(x, a) p(x) dx$

$f(x, a) = a+b+x$

$m =\frac{1}{b-a} \int_{a}^{b} (x+a+b) dx$

$= \frac{1}{b-a} ([0.5 x^{2}]^{b}_{a} + [ax]^{b}_{a}+ [bx]^{b}_{a})$ $= \frac{b+a}{2} + \frac{a(b-a)}{b-a}\frac{b(b-a)}{b-a} = \frac{b+a}{2} + a + b$

$ \frac{d}{da} m = \frac{3}{2}$

$ \frac{d}{db} m = \frac{3}{2}$

This may be the analytical solution, but we could approximate this function by monte carlo:

$\frac{1}{N}\sum_{i}^{N} (x_{i}+a+b)$

What we would like is that computing the gradient of this MC approximation is close to the analytical.

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

tf.__version__
Out[1]:
'1.3.0'
In [2]:
graph = tf.Graph()
with graph.as_default():
    x = tf.placeholder(shape=(None,), dtype=tf.float32)
    
    a = tf.Variable(1,dtype=tf.float32)
    b = tf.Variable(3,dtype=tf.float32)
    
    mc_mean = tf.reduce_mean(x+a+b)
    grads = tf.gradients(mc_mean, [a,b])
    init_op = tf.global_variables_initializer()
In [3]:
sess = tf.InteractiveSession(graph=graph)
sess.run(init_op)
In [4]:
X = np.random.uniform(low=1, high =3,size=(1000))
feed_dict = {x:X}
print(sess.run([grads], feed_dict=feed_dict))
[[1.0000013, 1.0000013]]

hmmm...that's not right. The reason for this is that the sampling operation is also dependent on the parameters $a$ and $b$, but this is done outside of tensorflow, so the gradient doesn't flow into the mc_mean variable.

In order to deal with this, we can reparameterize the sampling procedure. If we do random draws from a $U[0,1]$, we can transform this into a draw from $U[a,b]$ by doing;

$a + (b-a)x$

where $x \sim U[0,1]$.

In [5]:
graph = tf.Graph()
with graph.as_default():
    x = tf.placeholder(shape=(None,), dtype=tf.float32)
    
    a = tf.Variable(1,dtype=tf.float32)
    b = tf.Variable(3,dtype=tf.float32)
    z = a + (b-a)*x
    mc_mean = tf.reduce_mean(z+a+b)
    grads = tf.gradients(mc_mean, [a,b])
    init_op = tf.global_variables_initializer()
In [6]:
sess = tf.InteractiveSession(graph=graph)
sess.run(init_op)
In [7]:
X = np.random.uniform(low=0, high =1,size=(1000))
feed_dict = {x:X}
print(sess.run([grads], feed_dict=feed_dict))
[[1.5035391, 1.4964648]]

Theory

The idea of reparameterisation was largely popularised by a paper released in 2014 Autoencoding Variational Bayes, (Kingma, 2014). In this paper, they lay out 3 ways we can reparameterize probability distributions so samples have the appropriate gradient information.

  1. If the pdf we wish to sample from has a tractable inverse cdf, we can use this to transform a $U[0,1]$ sample to a sample from the target distribution, and the gradient will be properly accounted for.

  2. The approach used for this post, and also for the Gaussian case, fall under the location-scale family. A large class of distributions have a 'standard' form, which can be transformed to be draws from the same distribution with a different scale and/or location.

  3. Transformations and/or composition of draws from other distributions. For example, we can exponentiate random normal draws to get a log normal, and we can generate normal draws using the location-scale reparameterization.

Example of an inverse CDF

A simple example of the inverse method is the Rayleigh distribution. The CDF is given by;

$C = 1 - exp{\frac{x^2}{2\sigma^{2}}}$

We can invert this by;

$1-C = exp{\frac{x^2}{2\sigma^{2}}}$

$ln (1-C) = \frac{x^2}{2\sigma^{2}}$

$\sigma\sqrt{2ln (1-C) }= x$

So we can generate a value for $C$ from a standard continuous uniform, and then transform it into a rayleigh distributed variable, with dependence on parameters properly encoded.

This is valid for a wide range of distributions, and is an alternative when a location-scale type transform isn't possible.

Checking the status of distributions in Tensorflow

With 3rd party libraries, it is sometimes not obvious if gradient information is allowed across the sampling operation or not. Both behaviours are useful at some time, but not all distributions can be framed in a reparameterised way.

Tensorflow has a flag FULLY_REPARAMETERIZED for gradient flowing distributions, and NOT_REPARAMETERIZED if it doesn't.

We can check this under the Tensorflow graph by doing;

distribution_instance.reparameterization_type == tf.contrib.distributions.FULLY_REPARAMETERIZED

Summary

In this post, we have demonstrated why it is important to ensure that gradients are properly defined over a sampling operation, which is particularly relevant to computing gradients of monte carlo estimates in automatic differentiation frameworks.