Using the ELBO in practice

Introduction

In the previous post, we derived the ELBO in practice. We saw that by maximising the lower bound, given by;

$E_{q}[log \ P(X,Z) - log \ q(Z)] = \mathcal{L}$

we can approximate the posterior $P(Z\mid X)$ with $q(Z)$.

For this post, I will go through how we can use Tensorflow to to learn an approximation to the true posterior.

Computing the ELBO in practice

What we will concerned with in this post is how to compute the ELBO in practice, without having to resort to special case simplifications. This approach forms part of a set of approaches termed 'Black Box' Variational Inference.

$\mathcal{L} = E_{q}[log \ P(X,Z) - log \ q(Z)] $

The advantage of this formulation is that as we want a tractable approximation it should be simple to sample from $q$. This means we are able to approximate the Expecation as;

$E_{q}[log \ P(X,Z) - log \ q(Z)] \approx \frac{1}{M}\sum_{i}^{M} [log \ P(X, Z_{i}) - log \ q(Z_{i})]$

Where $Z_{i}$ is sampled under $q$.

We can split these terms out a little to make things a little easier;

$ = \frac{1}{M}\sum_{i}^{M} log \ P(X \mid Z_{i}) + \frac{1}{M}\sum_{i}^{M}log \ P(Z_{i}) - \frac{1}{M}\sum_{i}^{M}log \ q(Z_{i})$

As the likelihood is almost always I.I.D, we can further split up the likelihood term;

$ = \frac{1}{M}\sum_{i}^{M} \sum_{j}^{N}log \ P(X_{j} \mid Z_{i}) + \frac{1}{M}\sum_{i}^{M}log \ P(Z_{i}) - \frac{1}{M}\sum_{i}^{M}log \ q(Z_{i})$

Using the above formula we can easily compute a Monte carlo estimate of the ELBO, irrelevant of the form of the joint distribution.

There is one subtly, however, lurking beneath the surface.

Using the MC ELBO estimate

The esimate we have derived above is very simple to implement providing we can evaluate the joint distribution and we can sample from $q$.

Our goal is to maximise the ELBO, using the MC approximation. If we wish to use a library like tensorflow, or other library which uses automatic differentiation in order to do gradient based optimization, we have to ensure that all the gradients flow correctly.

In a general sense, sampling is not an operation over which gradients will flow under automatic differentiation. Intuitively, we can see this is the case because sampling is non-deterministic, so the gradients will also be non-deterministic. We will explore this further in a future post.

One way to get around this is to use the so-called Reparameterization Trick. Within Tensorflow, all variables that implement some method for sampling over which gradients will flow have a FULLY_REPARAMETERIZED flag. You can see this check in the code that follows.

In future posts, we will look into the details of the Reparameterization Trick, as well as ways to deal with distributions where this is not possible.

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

tf.__version__
Out[1]:
'1.3.0'

Problem setup

In order to introduce the ELBO, and to be able to examine the results, we will use a conjugate distribution so we can easily derive the analytical solution.

We will use a simple linear regression, treating the weights, $W$ as our latent variable.

$P(W) = N(W; 0.8, 1.0)$

$P(t\mid X, W) = N(t; y, S_{N})$

where $y = Xw$

Our aim is to compute an approximation to the posterior $P(W| \ t, \ X)$.

Firstly, we will generate the data we will use.

In [2]:
N = 200

X = np.random.uniform(low=-50, high=50, size=(N, 1))
T = 3.2*X + np.random.normal(size=(N, 1), scale=5.5)
In [3]:
plt.scatter(X,T)
Out[3]:
<matplotlib.collections.PathCollection at 0x7fe4552bc160>

Below is a simple tensorflow graph implementing the theory in the above sections. One additional trick, is we have wrapped the variable representing the standard deviation of the $q$ distribution in a softplus - this is to ensure our system doesn't degenerate by forcing the std to become either numerically 0 or negative.

In [10]:
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu'):
    latent_samples = 100
    t = tf.placeholder(shape=(None,1), dtype=tf.float32) # targets
    x = tf.placeholder(shape=(None,1), dtype=tf.float32) # independent variables
    
    q_m = tf.Variable(initial_value=1.0) # variational mean
    q_s = tf.Variable(initial_value=0.3) # variational std
    
    s = 1.0 # prior std
    m = 0.9 # prior mean
    s_l = 5.5 # likelihood std
    
    prior_dist = tf.distributions.Normal(m,s)       # prior distribution
    q_dist = tf.distributions.Normal(q_m, tf.nn.softplus(q_s)) #q distribution
    
    #Show that this is reparametrizable
    assert(q_dist.reparameterization_type 
           == tf.contrib.distributions.FULLY_REPARAMETERIZED)
    
    w = q_dist.sample(latent_samples) #Sample from q 
    w_u = tf.unstack(w) #We want to iterate over each sample, so unpack them
    log_liklihood = [] #List to hold each term for the monte carlo estimate
    log_prior = []
    log_q = []
    for z in w_u:
        #For each sample from W, compute the log probability 
        #of t under the likelihood;
        log_liklihood.append(
            tf.reduce_sum(tf.distributions.Normal(x*z, s_l).log_prob(t)))
        # The log probability of the sample under the prior
        log_prior.append(prior_dist.log_prob(z))
        # log probability of the sample under the q distribution
        log_q.append(q_dist.log_prob(z))
 
    mc_kld=  tf.reduce_mean(log_q) - tf.reduce_mean(log_prior) 
                            #average each term over samples of w 
                            #to get monte carlo estimate of 
                            #the expectation
    
    loss = tf.reduce_mean(log_liklihood) - mc_kld
    train_op = tf.train.AdamOptimizer(0.2).minimize(-loss)
    init_op = tf.global_variables_initializer()
In [11]:
sess = tf.InteractiveSession(graph=graph)
sess.run(init_op)
In [12]:
for i in range(2001):
    feed_dict = {x: X, t:T}
    sess.run([train_op], feed_dict=feed_dict)
    
    if i % 250 == 0:
        print(sess.run([q_m, tf.square(tf.nn.softplus(q_s))]))
        q_mean = q_m.eval()
        q_std = sess.run([tf.nn.softplus(q_s)])
[1.2, 0.55412644]
[3.1951087, 0.00074953539]
[3.1954536, 0.00036324651]
[3.1947501, 0.00025111937]
[3.1938899, 0.00020788528]
[3.1952934, 0.00019261954]
[3.1944358, 0.00018552219]
[3.1945758, 0.00018313997]
[3.1910744, 0.00018014094]

We can derive the mean and variance of the posterior analytically, (for more information, see PRML by Bishop).

The posterior variance is given by $S_{P} = (\frac{1}{5.5^{2}}X^{T}X + 1)^{-1}$.

The posterior mean is given by $m_{P} = S_{P}(0.9+\frac{1}{5.5^{2}}X^{T}t)$

In [13]:
analytical_posterior_var = ((1/5.5**2)*X.T@ X +1)**-1
analytical_posterior_var
Out[13]:
array([[ 0.00018215]])
In [14]:
analytical_posterior_mean = analytical_posterior_var*(0.9+((1/5.5**2)*X.T @ T))
analytical_posterior_mean
Out[14]:
array([[ 3.1941171]])
In [15]:
from scipy.stats import norm
xn = np.arange(3.1, 3.5, 0.0001)
true_dist = norm(loc = analytical_posterior_mean, 
                 scale =(analytical_posterior_var)**0.5)
q_dist = norm(loc = q_mean, scale = q_std)
yn = true_dist.pdf(xn).ravel()
plt.plot(xn, yn, linewidth=3, label="True Posterior")
plt.plot(xn, q_dist.pdf(xn).ravel(), '--', linewidth=3,
         label="Approximation")
plt.legend()
Out[15]:
<matplotlib.legend.Legend at 0x7fe41c716f98>

As we can see, this is a very close approximation, and if we run for more steps, we will get an exact analytical solution.

Summary

In this post, we have introduced the Monte Carlo approach to computing the ELBO, talked very briefly about some considerations, and used tensorflow to compute a simple example.