Learning Gaussian Process Covariances

A Gaussian process is a non-parametric model which can represent a complex function using a growing set of data. Unlike a neural network, which can also learn a complex functions, a Gaussian process can also provide variance (uncertainty) of a data since the model is based on a simple Gaussian distribution.

However, like many machine learning models, we have to define a set of functions to define a Gaussian process. In a Gaussian process, the function of uttermost importance is a covariance function. It is common to use a predetermined function with fixed constants for a covariance, but it is more pragmatic to learn a function rather than search high dimensional space using sampling based methods to find the best set of parameters for the covariance function.

In this post, I summarize a simple gradient based method and a scalable version of learning a covariance function.

Brief Summary of a Gaussian Process

In a Gaussian process, we assume that all observations are sample from a Gaussian distribution and any subset of the random variables (observations or predictions at a new data point) will follow a Gaussian distribution with specific mean $m(\mathbf{x})$ and covariance $K(X, X)$.

Let $\mathcal{D} = { (\mathbf{x}_i, y_i) }_i^n$ be a dataset of of input $\mathbf{x}_i$ and corresponding output $y_i$. We assume that the observation is noisy and the noise free output at value $\mathbf{x}$ is $\mathbf{f}(\mathbf{x})$, i.e., $\mathbf{y}(\mathbf{x}) = \mathbf{f}(\mathbf{x}) + \epsilon$.

Given the Gaussian process assumption, all subsets follow a Gaussian distribution and thus, the entire dataset can be represented using a single Gaussian distribution.

$$ \mathbf{f} | X, \mathbf{y} \sim \mathcal{N}(\mathbf{\bar{f}}, K(X, X)) $$

Please refer to the previous post about a Gaussian process for details.

Learning the Covariance $K(X, X)$

In many cases, the covariance function $K(X, X)$ is predefined as a simple function such as a squared exponential. There are many variants of the function, but in its simplest form, the squared exponential function contains at least two hyper-parameters, $c$ and $\sigma$

$$ k(\mathbf{x}_1, \mathbf{x}_2) = c \exp \left( \frac{|\mathbf{x}_1 - \mathbf{x}_2|^2}{\sigma^2} \right). $$

We can use simple grid search or MCMC to find the optimal hyper-parameters. However, as the function gets more complex, finding optimal hyper-parameters can become a daunting task pretty quickly as the dimension gets larger.

Instead, we can use a simple gradient descent based method with multiple initializations to find the optimal hyper-parameters.

Gradient of the Posterior Probability

To take gradient steps w.r.t. hyper-parameters, we need to compute the gradients w.r.t. hyper-parameters. Let all the hyper-parameters in a covariance function as $\theta$.

$$ \begin{align} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \log |K| - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) + c\\ \nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \mathrm{Tr} \left( K^{-1} \frac{\partial K}{\partial \theta_i} \right) - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) \end{align} $$

Given the gradient, we can use a gradient based optimizer of our choice to learn the hyper-parameters (or simply parameters) of a Gaussian process.

Scalability of the Gradient

In the previous section, we assumed that we can compute the gradient exactly. However, if the dimension of the vector $y$, $n$ increases, it might not be possible to compute the above gradient in a reasonable time and cost. Let’s analyze the computational complexity of each term.

First, note that $K^{-1}y$ requires solving a linear system which takes $O(n^3)$ complexity if we use a decomposition based method or $O(\sqrt{\kappa} n^2)$ if we use an iterative method like Conjugate Gradient, where $\kappa$ is the condition number of $K$.

Now, we can compute the complexity of each term. The first term, $K^{-1} \frac{\partial K}{\partial \theta_i}$, can take $O(\sqrt{\kappa} n^3)$ if we use iterative method or $O(n^3)$ if we can cache decomposition. The second term would only take $O(\sqrt{\kappa} n^2)$ as solving the linear system takes the most time.

Sampling the Gradient

As the dimension of the problem gets larger, it would be impractical to solve the system using a matrix decomposition and we need to resort to an approximate method. The paper by Filippone and Engler 1 propose to sample unbiased gradient using i.i.d. $N_s$ vectors. For example, let $r^j$ be the $j$th element of the vector $\mathbf{r}$. If we set $r^j \in {-1, 1}$ with equal probability, $\mathbb{E}(\mathbf{r}\mathbf{r}^T) = I$ and

$$ \begin{align} \mathrm{Tr}\left(K^{-1}\frac{\partial K}{\partial \theta_i}\right) & =\mathrm{Tr}\left( K^{-1}\frac{\partial K}{\partial \theta_i} \mathbb{E} \left[ \mathbf{r} \mathbf{r}^T \right] \right) \\ & = \mathbb{E} \left[ \mathbf{r}^T K^{-1}\frac{\partial K}{\partial \theta_i} \mathbf{r} \right] \end{align} $$

We can solve $K^{-1}\mathbf{r}$ easily using Conjugate Gradient and thus, the complexity of the above equation is $O(\sqrt{\kappa}n^2 N_s)$ where $N_s$ is the number of samples. Finally, the gradient becomes

$$ \nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) \approx - \frac{1}{2N} \sum_i^N \mathbf{r}_i^T K^{-1} \frac{\partial K}{\partial \theta_i} \mathbf{r}_i - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) $$

Conclusion

In this post, we covered how to train a covariance function in a Gaussian process using gradient based methods. As the method is not very scalable, we also discussed how to use random samples to approximate the gradient.

References

  1. M. Filippone and R. Engler, Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE), ICML’15 

Leave a Comment