# Intro¶

The idea of this post is to use the advent of code problems as a jumping off point for some ML theory. Unfortunately, these take a lot longer to produce that it does to solve the problems in the easy way!

The key point I want to demonstrate here is just how far we have to go on differentiable sparsity.

## The Problem¶

The crux of the problem is to find two numbers in a list that sum to 2020.

You can see the full description here; https://adventofcode.com/2020/day/1

```
import torch
from torch import nn
from torch.autograd import Variable
from torch.nn import functional as F
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

```
with open('input.txt') as f:
x_data = [float(d) for d in f.readlines()]
```

```
x = np.array(x_data).reshape(-1,1)
all_sums = x + x.T
np.unravel_index(np.argmin((all_sums-2020)**2), (len(x), len(x)))
```

```
x[147] + x[162], x[147] * x[162]
```

Well, that was no fun. There's not even a deep learning framework involved! What use is this?

## The problem as a regression¶

We can see that actually, this is just a good old regression problem. Find w, such that $Xw = y$ with $X$ our input data, $y$ 2020, with a regularisation constriant.

So what we want is a solution $w$ that has 2 non-zero entries. This is the l0-norm.

Ok, sure. This is not the best way to solve this problem. But it has several interesting features:

- It has an exact solution
- It is small enough that a niave solution runs in trivial time.
- We don't even have to learn any weights - just the Indicator if a variable should be included or not.

Surely, in 2020, deep learning has got this?

### Problem with L0¶

We can't just slap an L0 norm as a regularizer and start training, as L0 norm is non-differentiable. This is actually the start of a quite deep rabit hole into sparse models.

This problem is closely related to sparsification - so this is where we start.

What we need is a solution to the problem with $||w||_{0}=2$

### L1 relaxation¶

The L1 is the convex envelope of the L0. Essentially, it is convex (and piecewise differentiable) and has the same minimum point. Unfortunately, this is not really appropriate here. The relaxation is $||w||_{1} = 2$. But $[0.5, 1, 0.5]$ and $[1,1,0]$ have the same norm (and therefore are considered equally optimal).

There are lots of other relaxations that we won't go into here.

### An approximation to the L0¶

We can experiment with some cutting edge theory! The paper I'm going to look at is '**Learning Sparse Neural Networks through $L_0$ Regularization**': https://arxiv.org/abs/1712.01312

I won't labour the point, I suggest going through the paper. The essential idea is you have some distribution that generates random variables between 0 and 1. You then compute an expectation of your loss function under these draws. If the R.V. was bernoulli, you're trying to learn the parameter p, so that it is ideally 1 for the important features. If you learn parameters that are sufficiently close to 1 or 0, the expectation of your loss function will be the loss under the chosen features. You place a penalty on the parameter values to make sure they obey sensible constraints, and then minimise the whole thing.

If $x$ is $nx1$, you have $n$ parameters to learn, essentially each $x_{i}$ has an associated parameter $\pi_{i}$. If we generate bernoulli random variables $\pi_{i} = P(z_{i}=1)$. We want these probabilities to be 1 for the variables we care about and 0 otherwise, so we have a regularizer:

$\mathcal{L}_{r} = (\sum_{i} \pi_{i} -2)^{2}$

And a likelihood loss:

$\mathcal{L}_{l} = E_{q(\mathbf{z}|\mathbf{\pi})}[(xz - 2020)^{2}]$

The problem is that generating bernoulli variables in a differentiable way is not easy - so we need to use different ways to estimate the gradient. The paper suggests an alternative: Use the Concrete distribution. Again I won't go over that here in the interest of brevity, but check out the paper https://arxiv.org/abs/1611.00712 .

The important thing to note is we generate random variables and compute an expectation, whilst optimising over the distribution parameters - this makes the objective nice and smooth and can use the reparam trick to differentiate through sampling if we need to.

And so, we have our objective, and we'll give it a go on a smaller problem.

```
# Code heavily borrowed from the papers implementation code.
def hard_sigmoid(x):
return torch.min(torch.max(x, torch.zeros_like(x)), torch.ones_like(x))
class L0Norm(nn.Module):
def __init__(self, origin, beta=1/ 3, gamma=-0.1,
zeta=1.1):
super(L0Norm, self).__init__()
self._origin = origin
self._size = self._origin.size()
self.loc = nn.Parameter(torch.zeros(self._size).normal_(0, 0.01))
self.beta = beta
self.register_buffer("uniform", torch.zeros(self._size))
self.gamma = gamma
self.zeta = zeta
self.gamma_zeta_ratio = np.log(-gamma / zeta)
def get_transform(self, params):
return torch.sigmoid(params- self.beta * self.gamma_zeta_ratio)
def get_indicators(self):
self.uniform.uniform_()
u = Variable(self.uniform)
s = torch.sigmoid((torch.log(u) - torch.log(1 - u) + self.loc) / self.beta)
s = s * (self.zeta - self.gamma) + self.gamma
penalty = self.get_transform(self.loc).sum()
return hard_sigmoid(s), (penalty-2)**2
def forward(self, input_data):
mask, penalty = self.get_indicators()
return (mask*input_data).sum(), penalty
```

```
def train(x, n_iter, lambdas):
m = L0Norm(x, gamma=-1.0, zeta=2)
loss_likelihood = torch.nn.MSELoss()
y_true = torch.tensor(2020.0)
learning_rate = 1e-2
optimizer = torch.optim.Adam(m.parameters(), lr=learning_rate)
losses = []
penalties = []
best_z = None
best_loss = np.inf
for i in lambdas:
for t in range(n_iter):
y_hat, penalty = m.forward(x)
loss = loss_likelihood(y_hat, y_true) +i*penalty
if loss<1e-2:
print("Stopping with loss", loss)
return m
penalties.append(penalty)
losses.append(loss)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return m
```

### Toy Example¶

```
x_trial = torch.tensor([1000, 5, 1020, 90.0, 6.0])
m = train(x_trial, 5000, np.linspace(0.1, 100, 10))
```

```
w = m.get_transform(m.loc).round()
w.round()
```

This worked quite well - thats the correct answer!

### Actual problem¶

```
x_real = torch.tensor(x_data)
m = train(x_real, 150000, np.linspace(10000, 10000, 1))
```

```
w = m.get_transform(m.loc).round()
```

```
(x_real*w).sum()
```

```
w
```

This one doesn't work so well! I toyed around a lot with lots of training parameters. It seems there's two problems - both the size of the space to search and the space of the potential solutions - in the toy example theres only two main feasible solutions, but in the real example theres lots more "close" solutions. It was also a massive pain - you have to care about the regularization hyperparam and its not clear when its converged because the loss moves in a step like fashion as it moves the placement of the 1s.

If you didn't know there was 0 loss solution, I don't know how you'd decide this was finished training

#### In true deep learning style, with got something close to the right answer, with too many components.¶

Hopefully this is a bit thought provoking! Is this the best we have for proper l0 sparsity with deep learning?!

# Summary¶

This only sort of worked - for small problems with clear solutions it seemed to work quiet reliably - for example the trial problem i used first. However, I failed to get something working on the bigger advent of code problem - even with lots of tweaking parameters and really long training cycles - feel free to play yourself!

I think its safe to say, machine learning hasn't come to the rescue here. Maybe we need to add some bigger neural networks.

What this simple example shows is just how complex the problem of sparsity is - proper L0 sparsity is NP hard, and while its quite easy to have lots of small weights, actually even in the case an exact solution exists for a small problem, it takes quite some effort to get sensible outputs using traditional differentiation based optimisation. A timely reminder that just because a relaxation has the same optimal point as its true counterpart, actually getting to that solution can be non-trivial or impossible.