Graphical Models - Part 1

DAGs and D-Separation

Graphical models are a really useful technique to simplify lots of different steps in bayesian computation. In this post we will cover a few that I have found useful!

This post is about D-separation. You may wonder why its actually useful, as it seems like a bit of a tangent from the kind of stuff we normally do in Bayesian Computation.

Well, Hidden Markov models are one example - when we have a large graphical model, such as in time series data, where each point depends on previous points, a graphical approach let's us really, really easily simplify things. We'll see that in a future post.

In [1]:
from IPython.display import Image

DAGs

Directed Graphs are a way to encode dependence relationships in joint distributions. Often these are used to represent Hierchical models. Consider the hierchical model:

$p(x_1, x_2, x_3, x_4, x_5, x_6) = p(x_6|x_4, x_5)p(x_4| x_1, x_2) p(x_5| x_2, x_3)$

We can draw this as a directed graphical model, shown below.

In [2]:
Image('images/DAG.jpg')
Out[2]:

An important property we shall assume unless stated is that the directed graphic is Acyclic. This is quite simply understood if we consider we can travel between nodes along the arrows in the direction of tail to head only. If we can find a path back to the starting node only going along the arrows in the correct direction, then our graph has a cycle.

This is not particularly common in bayesian models, as often our graph comes from a hierchical relationship like the one above, and so our graph is naturally Tree structured.

D Separation

D-Separation is a really useful shortcut to computing conditional independence relationships, even in relatively small graphs.

Let's consider the previous graph again:

$p(x_1, x_2, x_3, x_4, x_5, x_6) = p(x_6|x_4, x_5)p(x_4| x_1, x_2) p(x_5| x_2, x_3)$

In [4]:
Image('images/DAG.jpg')
Out[4]:

Before we go, its probably useful to consider an example that will show why its much easier to use D-Separation rather than traditional probabilitisic techniques.

Let's imagine I observe a value for $X_6$. Conditioned on $X_6$, are $X_4$ and $X_3$ independent? Or in notation does $p(X_4, X_3| X_6) = p(X_4| X_6)p( X_3| X_6)$.

Have a go at working this out in this instance.

A simpler example to motivate D-Separation

Let's first consider a simpler graph, so we can see how D-Separation works. Let's consider a different graph $p(X_4, X_5, X_6) = p(X_4|X_5, X_6)p(X_5)p(X_6)$

In [5]:
Image('images/DAG2.jpg')
Out[5]:

We can use Bayes theorem here to calculate conditional independence.

$p(X_4, X_5, X_6) = p(X_6|X_4, X_5)p(X_4)p(X_5)$

We want $p(X_4, X_5 | X_6)$.

$p(X_4, X_5 | X_6) = \frac{p(X_6|X_4, X_5)p(X_4)p(X_5)}{p(X_6)}$

This in general doesn't factorise into something of the form $p(X_4 | X_6)p( X_5 | X_6)$.

In fact, in this, if we condition on any node, the other two are not conditionally independent.

Intuitively, consider we know the value of $X_6$. $X_4$ and $X_5$ are still "coupled".

Consider $p(X_6|X_4, X_5) = \mathcal{N}(X_6|X_4, X_5)$.

If we observe $X_6=10$, this tells us about likely configurations of both $X_4$ and $X_5$ together. $X_4 = 8$ and $X_5=5$ are reasonable, whereas $X_4 = 200$ and $X_5=5$ are not.

So we see, when we condition on a node where arrows meet head to head, we create a dependence between the parents.

Chain DAGs

If we have a graph we can draw as a chain, for example $p(x_1, x_2, x_3) = p(x_1)p(x_2| x_1) p(x_3|x_2)$, let's see what happens when we condition on $x_2$.

$p(x_1, x_3| x_2) = \frac{p(x_1)p(x_2| x_1) p(x_3|x_2)}{p(x_2)}$

we want to show if this is equal to $p(x_1| x_2)p(x_3| x_2)$

$p(x_1, x_3| x_2) = \frac{p(x_1)p(x_2| x_1) }{p(x_2)}p(x_3|x_2) = p(x_1| x_2)p(x_3| x_2)$

Again this makes sense, and lets take normal distributions as an example.

$p(x_1)p(x_2| x_1) p(x_3|x_2) = \mathcal{N}(x_1| 0, 1)\mathcal{N}(x_2| x_1, 1)\mathcal{N}(x_3| x_2, 1)$

Knowing $x_2=3$ tells us likely values of $x_1$. For $x_1$, 2 is a reasonable value it could have taken based on $x_2$, but -2 is probably unreasonable.

We can do the same with $x_3$. But notice in this example, we are talking about the effects on each node alone, not in pairs like before. This is intuitively conditional independence.

So, from this example, we see Conditioning on a head to tail node leads to conditional independence

General D-Separation

Now we have the intuition, we can state the general D-separation algorithm for a DAG.

We look for conditional indepedence between 2 nodes, a and b, conditioning on a set of nodes C.

The general plan is to traverse along arrows (we can travel head to tail or tail to head, the arrow direction doesn't matter when it comes to a path), and look for conditions to hold at any node on that path.

We say a and b are conditionally independent given C if every path between a and b is blocked.

A path is blocked if there is a node on that path:

1) where the node is in the set C, and the arrows on the path meet tail to tail or head to tail at the node

2) where the node and none of its dependents are not in C, and the path arrows meet head to head.

You can go back and verify this on the simple examples to check you understand the conditions.

Back to the complex example

$p(x_1, x_2, x_3, x_4, x_5, x_6) = p(x_6|x_4, x_5)p(x_4| x_1, x_2) p(x_5| x_2, x_3)$

Given $x_6$ are $x_3$ and $x_4$ conditional independent?

In [6]:
Image('images/DAG.jpg')
Out[6]:

There are 2 paths from $x_3$ and $x_4$. 1) $x_3, x_5, x_2, x_4$ 2) $x_3, x_5, x_6, x_4$

Path 1

We have to check the conditions at nodes 5 and 2.

$x_5 \notin \{x_6\}$ but $x_5$ has a dependent which is $x_6$ which is in the conditioning set. So this violates the second condition of d-separation. $x_5$ is not in the conditioning set, so condition 1 does not apply either. Thus the path is not blocked by $x_5$

Equally, $x_2$ does not block the path, because it has $x_6$ as a dependent too!

So, we can stop here. There is at least one path that is not blocked, and so $x_3$ and $x_4$ are not conditionally independent given $x_6$!

Summary

This last example shows the power of DAGs. If we are doing inference, we condition on some part of our distribution, and then do operations on the rest of the graph. We can have a really complex graphical model, and D-separation lets us find general independence properties in our posterior - which can make computation much easier.

Doing the same thing with product and summation rules and Bayes theorem is of course possible. But in a large graph it can be incredibly difficult!