Today we are going to take a deep look at how to perform inference about a systems where we suspect the data is jointly Gaussian.

Why do we care so much about Gaussian distributions, rather than any other continuous distribution? The normal distribution is used everywhere, from modeling white noise in econometrics, modeling log-returns of stock prices, and noisy observations in control. There are several nice properties about Gaussian random variables. One reason is computational: the sum of two Gaussians is also a Gaussian, conditioning and marginalizing a Gaussian yields a Gaussian, etc. Another is that the Central Limit Theorem suggests that the Gaussian is a sort of “attractor” since averages of random variables tend to Gaussians.

**The Intuition Behind Multivariate Gaussian**

Most people have a clear intuition about the standard normal distribution. If we have , then random draws of will on average have value , and gives us a sense of how spread-out the distribution is. The probability density function is, as everyone knows

What does it mean to have a multivariate Gaussian? First, we can easily imagine having an -dimensional vector , where each is drawn from its own Gaussian independently, i.e. . Because each entry is independent, it’s clear that the probability density function for such a vector would be

We can simplify this by using matrix notation. In particular, let’s define

Then we can simplify the pdf for the multivariate Gaussian to be

What if the elements of a -dimensional vector are linearly correlated combinations of independent Gaussians? That is, what if the vector is so that

where the are drawn from independent univariate Gaussians. As the above notation hints at, lets say that we have a matrix such that

What is ? Well, this is just using a change of variables. From calculus we know that if and we know the pdf for then the pdf for is given by $latex p_X(f^{-1}(

Y))|\det \frac{\delta X}{\delta Y}|$.

We note that already encodes the information that does. To make computation cleaner, let’s assume that (in other words, that is the identity matrix). Now we have

What do we see here? By taking linear combinations of independent Gaussian random variables to obtain , we have a new Gaussian parameterized by mean and covariance matrix .

This is a tricky point that took me a while to understand: first that covariance is a measure of linear dependence, second that a multivariate Gaussian can be interpreted as linearly composing independent Gaussians.

**Two Different Forms**

We have now seen one form the of multivariate Gaussian. This is the \textbf{covariance form}

We can do some algebra to obtain another form:

Now define and . Then we can re-write the above as

This is called the *information form* of a Gaussian. The parameter is known as the information, while is known as the potential. We will refer to a Gaussian parameterized in this form as being .

Why do we need to switch forms for Gaussians? It turns out that each form has advantages and disadvantages for different tasks.

The covariance form is:

- Easy to recover the mean and covariance. This comes directly from the paramterization.
- Easy to marginalize. Through some lengthy computation we can show that .
- Hard to perform conditioning. We can show that .

On the other hand, the information form is:

- Easy to perform conditioning. Indeed . Intuitively we see that conditioning “shifts” the Gaussian by the cross-terms between and .
- Hard to marginalize. We have .
- Easy to understand in terms of an undirected graphical model. We will see this shortly.

How do we convert between covariance and information form. We note that and . So if we knew the covariance form, we could go to information form by performing a matrix inversion, and vice-versa, but this seems expensive. Thankfully we can use tools like Schur’s complements and the Matrix Inversion Lemma to give us a nice form to make this easier.

That is, for these inference problems, there is a clean way to obtain inverses for matrices.

Converting between the two forms seems desirable. Conditioning, which is very important for inference, is quite easy in the information form. But if we ever want an estimate, then we want to know the mean, so we need to change to covariance form. Techniques such as Kalman filtering and Gaussian belief propogation will do just this.

**Gaussian Graphical Models**

It turns out that in the information form, a Gaussian random vector has a very natural interpretation as an undirected graphical model: The information matrix encodes a minimal undirected I-map for . We can see this quite naturally if we consider that

and let

Thus, the information form very naturally yields pairwise clique potentials. We also have the following theorem about the pairwise conditional independencies.

Theorem 1: For , if and only if .

Proof: We saw earlier that . Thus, if , then the two are conditionally independent given the rest of the vertices.

Thus, the information matrix encodes the conditional dependencies as well.

**Application: Gauss Markov Process**

Suppose we have a time series where and

and . This is a linear dynamic system where the next point is a linear transformation of the previous point, with some Gaussian noise. Now what if we don’t observe but we can observe

where . This is the \emph{hidden Gauss Markov process} which is at the heart of several financial time series models and other applications.

Are we able to recover the old given the . This is a direct application of Kalman filtering, which we shall see later.

**Conclusion**

The multivariate Gaussian is a commonly used way to model continuous distributions. We showed how the covariance matrix naturally arises from the construction of a Gaussian vector as a linear transformation of independent Gaussian random variables. We then introduced the *information form* of the multivariate Gaussian and showed how it makes conditioning easy and has an intuitive representation as an undirected graphical model. Finally, we introduced how the multivariate Gaussian can be used to model a linear dynamic system. In the future we will discuss Kalman filtering as a way to perform inference for such systems.