Dropout as a Bayesian Method

Dropout is a common method to prevent overfitting in neural networks. This post covers Dropout as a Bayesian Approximation.

Dropout

Dropout is a way to reduce overfitting during training of neural networks. The idea is simple - if a single sample of input to a layer in out network is N dimensional, we sample N independent Bernoulli variables with a probability $p$ (also called the dropout probability), and take the product of each Bernoulli sample with the corresponding entry of our input.

$\mathbf{z} = \left[\begin{matrix}z_{0} \\ ... \\ z_{i} \\ ... \\ z_{N}\end{matrix}\right]$

$z_{i} \sim Bernoulli(p)$

$\mathbf{\hat{y}} = \sigma((\mathbf{x} \cdot \mathbf{z})^{T}W +b)$

Where $\sigma$ is some activation function, $\mathbf{x}$ is a vector of dimension N, representing a single sample, and $W$ and $b$ are weights and biases for the layer respectively.

For deep networks, we can simply repeat the process as so:

$\mathbf{\tilde{y}} = \sigma((\mathbf{\hat{y}} \cdot \mathbf{z_{2}})^{T}W_{2} +b_{2})$

In order to train the network, we have a loss function, which I denote by $Er()$, and some additional penality around the parameters themselves, such as a complexity penalty:

$\mathcal{L} = Er(\mathbf{\tilde{y}}, \mathbf{y}) + \lambda(||W||_{2}^{2}+||W_{2}||_{2}^{2}+||b||_{2}^{2} + ||b_{2}||_{2}^{2})$

For regression, $Er$ could be the normal squared error:

$Er(\mathbf{\tilde{y}}, \mathbf{y}) = ||\mathbf{\tilde{y}}- \mathbf{y}||_{2}^{2}$

Intuition

Dropout works because for every sample, each feature may be zero-ed out with probability p. This means in order to make good predictions, the network cannot rely too heavily on any specific feature. Generalisation is, at a high level, having a network that is not dependent on specific patterns within the training set that do not hold in general.

This means that the network is too sensitive to particular features. Dropout helps combat this because any feature may be set to 0. If the network is heavily reliant on very particular intricacies of a feature, it will perform terribly if we set that feature to 0. So in order to achieve a good Loss, the network is forced to not exploit small transient patterns in the training set.

Dropout as a Bayesian method

The basic approach we will take is to view our network as a Gaussian Process. Let our final network outputs be $\mathbf{Y}$, and the inputs be $\mathbf{X}$. A Gaussian process with the correct kernel and sufficient form is capable of modelling any function, so this view is justified.

Let us consider a slightly simplified functional form that above:

$\mathbf{\tilde{y}} = \sigma(\mathbf{X}W_{1} +b)W_{2} + \epsilon$

$X$ has shape Nxd, $W_{1}$ is dxK, $W_{2}$ is Kx1.

We let $W$ correspond to the weight matricies after dropout is applied. If we denote the 'pre-dropout' matrix as $M$, $W = M \cdot z$.

This is exactly the same as we get for a linear regression, so we can equally say:

$\mathbf{\tilde{y}} \sim \mathcal{N}(\sigma(\mathbf{X}W_{1} +b)W_{2}, \tau^{-1}I)$

Let us place a prior on all the parameters, with the priors on $W_{1}$ and $W_{2}$ being standard gaussians.

$P(\mathbf{\tilde{y}}, W_{2}, W_{1}, b | \mathbf{X}) = P(\mathbf{\tilde{y}}| W_{2}, W_{1}, b, \mathbf{X})P(W_{1})P( W_{2})P(b)$

Here we have an anlogue to standard loss functions - if each of the terms above are normal, performing MAP estimation is equivalent to minimising:

$||y-\hat{y}^{2}|| + ||W_{1}||^{2}+ ||W_{2}||^{2}+ ||b||^{2}$

Which is a standard sum-of squares loss funciton with regularization terms.

As a Gaussian Process

We can also write this as a Gaussian process, which is the approach used in the paper:

$\mathcal{N}(\mathbf{\tilde{y}}| f, \tau^{-1}I_{d})\mathcal{N}(f|0, K)$

$f = \Phi W_{2}$

where $\Phi = \sigma(\mathbf{X}W_{1}+b)$

We can find the moments of f, knowing the prior $P(W_{1})$ and $P(W_{2})$ are standard normal.

$E[f] = E[\Phi W_{2}] = E[\Phi]E[ W_{2}] = 0$

$E[f f^{T}] = E[\Phi W_{2}W_{2}^{T} \Phi^{T}] = E[\Phi E[W_{2}W_{2}^{T}] \Phi^{T}] = E[\Phi \Phi^{T}]$

For a pair of samples, x and y (which form rows of $\mathbf{X}$), we can write this expecation as

$K(x,y)= \int \sigma(w^{T}x +b) \sigma(w^{T}y +b) P(w)P(b) dwdb$

and so:

$K_{n,m} = K(\mathbf{X}_{n},\mathbf{X}_{m})$

We can approximate the integral by samples:

$\hat{K}(x,y)= \frac{1}{K} \sum_{k=1}^{K} \sigma(w_{k}^{T}x +b_{k}) \sigma(w_{k}^{T}y +b_{k})$

Uncertainty Estimates of a New point

Given some new sample, $x'$, we want the distribution of $y'$.

$P(y' | x', \mathbf{X}, y) = \int P(y' | x', W_{1}, W_{2}, b)P(W_{1}, W_{2}, b | \mathbf{X}, y) dW_{1}dW_{2}db$

If we make a simple approximation to the posterior, setting $q(w)$ to a 2 component gaussian mixture:

$q(W) = \prod_{k=1}^{K} q(w_{k})$

$q(w_{k}) = p \mathcal{N}(m_{k}, \sigma I) + (1-p) \mathcal{N}(0, \sigma I) $

To sample a $w_{k}$ from this, we can use ancestral sampling.

Consider a random bernoulli variable, $z$, with probability p. Sample $z$, and if $z=1$, sample from $\mathcal{N}(m_{k}, \sigma I)$, else sample from $\mathcal{N}(0, \sigma I) $.

If we set $\sigma$ to tend to 0, this procedure tends towards:

if $z=1$, our sample is $m_{k}$, else our sample is 0.

This is exactly equal to:

w{k} = $m{k}\cdot z$.

Which is just applying dropout. So we can see that applying dropout with probability p is actually the limit case of sampling from a mixture of two gaussians, one with a mean at $m$, and the other a mean at 0.

Now we know how to draw approximate samples from the posterior, we can approximately evaluate the uncertainty.

$P(y' | x', \mathbf{X}, y) \approx \frac{1}{N} \sum_{n=1}^{N} P(y' | x', W_{1}^n, W_{2}^n, b^n)$

Where

$W^{n} = M \cdot Z^{n}$

In practice, this corresponds to simply applying dropout at test time, and the computing the probability of each output under a normal distribution under multiple samples, as specified above, and averaging the results for each data point.

Discussion

For me the key insight is the result the link between dropout and a mixture of gaussians. Under this framework, training the network corresponds to learning the means of one of the mixture components. Of course, the real crux of the issue is the appropriateness of this posterior - the uncertainty estimates we generate are only as good as our approximation to the true posterior. Knowing if this choice is sufficient to give us useful output is of course a much harder problem.

Empirically, we see that dropout improves generalisation of networks when appropriately applied.

We could view this as saying that overfitting corresponds to learning parameters $\omega = [W_1, W_2, b]$ so that $P(Y | X, \omega)$ is high, but $P(Y| X)$ is not. If this is the case, applying dropout is analogous to optimising the model evidence as opposed to the likelihood, which is a well established bayesian method to avoid overfitting.

Summary

This post was a short summary of the details of the paper by Gal et Al. I mainly focussed on the interpretation and the use of this approach to compute distributions over the output of a network. In the appendix to the original paper, the authors present a more complete set of results, such as how this applies to classification, and formulas for a variety of useful quantities like marginal likelihood etc, as well as expanding on some points I skimmed over here.