# Learning Gaussian Process Covariances

A Gaussian process is a non-parametric model which can represent a complex function using a growing set of data. Unlike a neural network, which can also learn a complex functions, a Gaussian process can also provide variance (uncertainty) of a data since the model is based on a simple Gaussian distribution.

However, like many machine learning models, we have to define a set of functions to define a Gaussian process. In a Gaussian process, the function of uttermost importance is a covariance function. It is common to use a predetermined function with fixed constants for a covariance, but it is more pragmatic to learn a function rather than search high dimensional space using sampling based methods to find the best set of parameters for the covariance function.

In this post, I summarize a simple gradient based method and a scalable version of learning a covariance function.

## Brief Summary of a Gaussian Process

In a Gaussian process, we assume that all observations are sample from a Gaussian distribution and any subset of the random variables (observations or predictions at a new data point) will follow a Gaussian distribution with specific mean $m(\mathbf{x})$ and covariance $K(X, X)$.

Let $\mathcal{D} = { (\mathbf{x}_i, y_i) }_i^n$ be a dataset of of input $\mathbf{x}_i$ and corresponding output $y_i$. We assume that the observation is noisy and the noise free output at value $\mathbf{x}$ is $\mathbf{f}(\mathbf{x})$, i.e., $\mathbf{y}(\mathbf{x}) = \mathbf{f}(\mathbf{x}) + \epsilon$.

Given the Gaussian process assumption, all subsets follow a Gaussian distribution and thus, the entire dataset can be represented using a single Gaussian distribution.

$$ \mathbf{f} | X, \mathbf{y} \sim \mathcal{N}(\mathbf{\bar{f}}, K(X, X)) $$Please refer to the previous post about a Gaussian process for details.

## Learning the Covariance $K(X, X)$

In many cases, the covariance function $K(X, X)$ is predefined as a simple function such as a squared exponential. There are many variants of the function, but in its simplest form, the squared exponential function contains at least two hyper-parameters, $c$ and $\sigma$

$$ k(\mathbf{x}_1, \mathbf{x}_2) = c \exp \left( \frac{|\mathbf{x}_1 - \mathbf{x}_2|^2}{\sigma^2} \right). $$We can use simple grid search or MCMC to find the optimal hyper-parameters. However, as the function gets more complex, finding optimal hyper-parameters can become a daunting task pretty quickly as the dimension gets larger.

Instead, we can use a simple gradient descent based method with multiple initializations to find the optimal hyper-parameters.

### Gradient of the Posterior Probability

To take gradient steps w.r.t. hyper-parameters, we need to compute the gradients w.r.t. hyper-parameters. Let all the hyper-parameters in a covariance function as $\theta$.

$$ \begin{align} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \log |K| - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) + c\\ \nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \mathrm{Tr} \left( K^{-1} \frac{\partial K}{\partial \theta_i} \right) - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) \end{align} $$Given the gradient, we can use a gradient based optimizer of our choice to learn the hyper-parameters (or simply parameters) of a Gaussian process.

## Scalability of the Gradient

In the previous section, we assumed that we can compute the gradient exactly. However, if the dimension of the vector $y$, $n$ increases, it might not be possible to compute the above gradient in a reasonable time and cost. Let’s analyze the computational complexity of each term.

First, note that $K^{-1}y$ requires solving a linear system which takes $O(n^3)$ complexity if we use a decomposition based method or $O(\sqrt{\kappa} n^2)$ if we use an iterative method like Conjugate Gradient, where $\kappa$ is the condition number of $K$.

Now, we can compute the complexity of each term. The first term, $K^{-1} \frac{\partial K}{\partial \theta_i}$, can take $O(\sqrt{\kappa} n^3)$ if we use iterative method or $O(n^3)$ if we can cache decomposition. The second term would only take $O(\sqrt{\kappa} n^2)$ as solving the linear system takes the most time.

## Sampling the Gradient

As the dimension of the problem gets larger, it would be impractical to solve the system using a matrix decomposition and we need to resort to an approximate method. The paper by Filippone and Engler ^{1} propose to sample unbiased gradient using i.i.d. $N_s$ vectors. For example, let $r^j$ be the $j$th element of the vector $\mathbf{r}$. If we set $r^j \in {-1, 1}$ with equal probability, $\mathbb{E}(\mathbf{r}\mathbf{r}^T) = I$ and

We can solve $K^{-1}\mathbf{r}$ easily using Conjugate Gradient and thus, the complexity of the above equation is $O(\sqrt{\kappa}n^2 N_s)$ where $N_s$ is the number of samples. Finally, the gradient becomes

$$ \nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) \approx - \frac{1}{2N} \sum_i^N \mathbf{r}_i^T K^{-1} \frac{\partial K}{\partial \theta_i} \mathbf{r}_i - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) $$## Conclusion

In this post, we covered how to train a covariance function in a Gaussian process using gradient based methods. As the method is not very scalable, we also discussed how to use random samples to approximate the gradient.

## References

M. Filippone and R. Engler, Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE), ICML’15 ↩

## Leave a Comment