Hamiltonian Monte Carlo

Background

Consider we wish to sample from a distribution: $p(\mathbf{x}) = \frac{1}{Z_{x}} e^{H_{x}(\mathbf{x})}$.
We introduce a secondary, independent momentum variable y, $p(\mathbf{y}) = \frac{1}{Z_{y}} e^{H_{y}(\mathbf{y})}$.

This gives us the joint distribution $p(\mathbf{x}, \mathbf{y}) = \frac{1}{Z_{x}} \frac{1}{Z_{y}} e^{H_{y}(\mathbf{y})+H_{x}(\mathbf{x})}$

Where we call the term $H(\mathbf{x}, \mathbf{y}) = H_{y}(\mathbf{y})+H_{x}(\mathbf{x})$, the joint energy.

We sample from the desired distribution by sampling from the joint, moving in a way that is given by Hamilton's equations.

In [1]:
from autograd import numpy as np
from autograd import grad
import matplotlib.pyplot as plt
%matplotlib inline

Consider the case where both momentum and the target distribution are given by 1d gaussians. We then get the following energy:

In [2]:
def energy(p, q):
    return -p**2 /2 -q**2/2
In [3]:
x = np.arange(-3, 3, 0.01)
y = np.arange(-3, 3, 0.01)
In [4]:
X, Y = np.meshgrid(x,y)
In [5]:
Z = energy(X, Y)
In [6]:
fig, ax = plt.subplots()
ax.contour(X, Y, Z, 30)
ax.set_xlabel('x - position')
ax.set_ylabel('y - momentum')
ax.set_title("Joint Energy surface");

As the above plot shows, if we take a simple gaussian for both our target and momentum distributions, we can plot contours of the joint energy to visualise the surface.

The high level approach

  • Our first step is to randomly pick a contour.
  • The next step is to explore the joint space by aiming to stick to that contour.

Picking a Contour

We do this rather simply, by sampling our momentum variables random from their distribution. This combined with the current x value, our current position, gives us a starting energy level.

Exploring the contour

Once we have an energy level (alternatively a {position, momentum} pair), we look to move around the space such that our chance of acceptance is high at the end of the process. We explore using $L$ steps, with step size $\epsilon$. Both of these are hyperparameters to be tuned.

Accepting the new point

Lets say we start at a point $(\mathbf{x}^{(i)}, \mathbf{y}^{(i)})$ and move to $(\mathbf{x'}, \mathbf{y'})$. As per Metropolis-Hastings we accept the new points with probability:
$min [1, \frac{exp(H(\mathbf{x}, \mathbf{y}))}{exp(H(\mathbf{x'}, \mathbf{y'}))}]$

If the point is accepted:
$(\mathbf{x}^{(i+1)}, \mathbf{y}^{(i+1)}) = (\mathbf{x'}, \mathbf{y'})$
else:
$(\mathbf{x}^{(i+1)}, \mathbf{y}^{(i+1)}) = (\mathbf{x}^{(i)}, \mathbf{y}^{(i)})$

It should be noted that if we proceed exactly so that we conserve the joint energy, then we shall accept all points. However, it is not that simple, as we will often have to approximately solve the Hamiltonian equations numerically, and so an exact solution is not always possible.

If we simulated very precisely, it is going to be computationally inefficient. If however, we are not exact enough, we will reject a large proportion of samples, and that will also be very inefficient.

Hamilton's equations

The equations which govern Hamiltonian dynamics are:

$\frac{dx_{i}}{dt} = \frac{dH}{dy_{i}}$
$\frac{dy_{i}}{dt} = -\frac{dH}{dx_{i}}$

This gives us a way to move from a known point to a new point. If we consider the terms $x_{i}$ and $y_{i}$ to be functions of time, $t$, we see that the change of $\mathbf{x}$ and $\mathbf{y}$ is given by the partial derivatives of the energy.

Hamilton's equations describe conservation of energy, and so are perfect for this use case. Essentially we pick a level on the surface, then explore that energy level.

Simple Euler Discretization

If we write:
$\frac{dx_{i}}{dt} \approx \frac{x_{i}(t+\epsilon) - x_{i}(t))}{\epsilon}$
we get:
$\frac{dH}{dy_{i}} \approx \frac{x_{i}(t+\epsilon) - x_{i}(t))}{\epsilon}$

Which we can rearrange to be:

$ x_{i}(t+\epsilon) \approx x_{i}(t) + \epsilon \frac{dH}{dy_{i}}(t)$

The same approach also gives:
$ y_{i}(t+\epsilon) \approx y_{i}(t) - \epsilon \frac{dH}{dx_{i}}(t)$

If we set $\epsilon$ to be small, this approximation is close, and it is exact in the limit as $\epsilon$ goes to 0.

Computational Demonstration of Euler Discretization

Let's use the 1d probability distribution as define before, and set the momentum, $y$ to be 2 at the start, and the position to be 0.5. This gives us the contour below:

In [7]:
x_start = np.array([0.5])
y_start = np.array([2])
start_energy = energy(x_start, y_start)
Z_true = np.where(np.isclose(Z, start_energy), 1, 0.0)
In [8]:
fig, ax = plt.subplots()
ax.contour(X, Y, Z)
ax.scatter(x_start, y_start, marker='x', c='r')
ax.set_xlabel('x - position')
ax.set_ylabel('y - momentum');
In [9]:
dHdx = grad(energy) #Use autograd to get the gradients
dHdy = grad(energy, 1)
eps = 0.2
L = 20
In [10]:
fig, ax = plt.subplots()
ax.contour(X, Y, Z, 15)
ax.scatter(x_start, y_start, marker='x', c='r')
ax.set_xlabel('x - position')
ax.set_ylabel('y - momentum')
ax.set_title('Euler Discretization')

x_current = x_start
y_current = y_start

for t in range(L):
    x_new = x_current + eps * dHdy(x_current, y_current)
    y_new = y_current - eps * dHdx(x_current, y_current)
    ax.scatter(x_new, y_new, marker='x', c='r')
    x_current = x_new
    y_current = y_new

As we can see, this discretization is not perfect at all - quite quickly we deviate from the same energy level. The only way to combat this under this scheme is to run it for many more iterations with a smaller stepsize, which is computationally intensive

Modified Euler Discretization

$ y_{i}(t+\epsilon) \approx y_{i}(t) - \epsilon \frac{dH}{dx_{i}}(t)$

$ x_{i}(t+\epsilon) \approx x_{i}(t) + \epsilon \frac{dH}{dy_{i}}(t+\epsilon)$

In [11]:
fig, ax = plt.subplots()
ax.contour(X, Y, Z, 15)
ax.scatter(x_start, y_start, marker='x', c='r')
ax.set_xlabel('x - position')
ax.set_ylabel('y - momentum')
ax.set_title('Modified Euler')

x_current = x_start
y_current = y_start

eps = 0.2
L = 30
for t in range(L):
    y_new = y_current - eps * dHdx(x_current, y_current)
    x_new = x_current + eps * dHdy(x_current, y_new)
    
    ax.scatter(x_new, y_new, marker='x', c='r')
    x_current = x_new
    y_current = y_new

Whilst this is not perfect (you can see it crossing contours above), it does much better, as it doesn't spiral off into infinity.

Leapfrog Discretization

This is essentially a refinement of the modified Euler approach.

$ y_{i}(t+\frac{\epsilon}{2}) \approx y_{i}(t) - \frac{\epsilon}{2} \frac{dH}{dx_{i}}(t)$

$ x_{i}(t+\epsilon) \approx x_{i}(t) + \epsilon \frac{dH}{dy_{i}}(t+\frac{\epsilon}{2})$

$ y_{i}(t+\epsilon) \approx y_{i}(t+\frac{\epsilon}{2}) - \frac{\epsilon}{2} \frac{dH}{dx_{i}}(t+\frac{\epsilon}{2})$

In [12]:
fig, ax = plt.subplots()
ax.contour(X, Y, Z, 15)
ax.scatter(x_start, y_start, marker='x', c='r')
ax.set_xlabel('x - position')
ax.set_ylabel('y - momentum')
ax.set_title('Leapfrog')

x_current = x_start
y_current = y_start

eps = 0.2
L = 30
for t in range(L):
    y_half_eps = y_current - 0.5* eps * dHdx(x_current, y_current)
    x_new = x_current + eps * dHdy(x_current, y_half_eps)
    y_new = y_half_eps - 0.5* eps * dHdx(x_new, y_current)
    
    ax.scatter(x_new, y_new, marker='x', c='r')
    x_current = x_new
    y_current = y_new

This is much better! We can even try it with a much higher step size and it will stay stable.

In [13]:
fig, ax = plt.subplots()
ax.contour(X, Y, Z, 15)
ax.scatter(x_start, y_start, marker='x', c='r')
ax.set_xlabel('x - position')
ax.set_ylabel('y - momentum')
ax.set_title('Leapfrog')

x_current = x_start
y_current = y_start

eps = 1.2
L = 50
for t in range(L):
    y_half_eps = y_current - 0.5* eps * dHdx(x_current, y_current)
    x_new = x_current + eps * dHdy(x_current, y_half_eps)
    y_new = y_half_eps - 0.5* eps * dHdx(x_new, y_current)
    
    ax.scatter(x_new, y_new, marker='x', c='r')
    x_current = x_new
    y_current = y_new

The important point is that even though this is not perfect, it is stable, and doesn't diverge to infinity.

Putting it all together

Now we have a better understanding how we can move around the energy landscape in a robust way, we can add it to the earlier points to get an algorithm to generate sample points from the target distribution.

In [14]:
def HMC_sample(x_start, y_start, H, dHdx, dHdy, eps=0.3, L=5):
    x_current = x_start
    y_current = y_start
    
    eps = eps * np.random.choice([-1,1])
    
    for t in range(L):
        y_half_eps = y_current - 0.5* eps * dHdx(x_current, y_current)
        x_new = x_current + eps * dHdy(x_current, y_half_eps)
        y_new = y_half_eps - 0.5* eps * dHdx(x_new, y_current)

        x_current = x_new
        y_current = y_new
        
    accept_threshold = np.minimum(1.0, np.exp(H(x_start, y_start) - H(x_current, y_current)))
    if np.random.uniform()<accept_threshold:
        #print("accept", x_current)
        return x_current, True
    return x_start, False
In [15]:
x_start = np.array([0.5])
dHdx = grad(energy)
dHdy = grad(energy, 1)

xs = []
n_accepted = 0
for i in range(10000):
    y_start = np.random.normal()
    sample, accepted = HMC_sample(x_start, y_start, energy, dHdx, dHdy)
    xs.append(sample)
    x_start = sample
    n_accepted += accepted
    
print("Accepted:",n_accepted/100, "%")
Accepted: 99.39 %
In [16]:
from scipy.stats import norm
x = np.arange(-4, 4, 0.1)
y = norm.pdf(x)
plt.hist(np.array(xs), bins=30, density=True);
plt.plot(x,y)
Out[16]:
[<matplotlib.lines.Line2D at 0x7f5db592b710>]
In [20]:
def energy_position(p):
    return -(p-2.5)**2/6

def energy_momentum(q):
    return - q**2/2

def non_standard_energy(p,q):
    return energy_position(p) + energy_momentum(q)
In [21]:
x_start = np.array([0.5])
dHdx = grad(non_standard_energy)
dHdy = grad(non_standard_energy, 1)

xs = []
n_accepted = 0
for i in range(10000):
    y_start = np.random.normal()
    sample, accepted = HMC_sample(x_start, y_start, non_standard_energy, dHdx, dHdy)
    xs.append(sample)
    x_start = sample
    n_accepted += accepted
    
print("Accepted:",n_accepted/100, "%")
Accepted: 99.85 %
In [24]:
from scipy.stats import norm
x = np.arange(-4, 9, 0.5)
y = norm.pdf(x, loc=2.5, scale=3**0.5)
plt.hist(np.array(xs), bins=30, density=True);
plt.plot(x,y)
Out[24]:
[<matplotlib.lines.Line2D at 0x7f5db59475c0>]

Summary

In this notebook, I have shown I basic implementation of HMC, and shown the basic idea of the approach. By introducing momentum variables, we can use Hamilton's equations to move around the space in a way that volume is still preserved.

By using the gradient information, hopefully HMC is able to be a bit smart that applying a pure random walk and hoping to stumble upon areas of high mass. Of course there is a lot of complexity not covered here, as the choice of step size and number of steps are critical to obtain good performance.