Gentle Introduction to Gaussian Process Regression
Parametric Regression uses a predefined function form to fit the data best (i.e, we make an assumption about the distribution of data by implicitly modeling them as linear, quadratic, etc.).
However, this approach fails as the number of dimensions of the data grows and as its distribution gets more complex.
Instead of coming up with complex parametric functions, we can simply let the data speak for itself. We let each datum be a random variable correlated with all the data by a predefined correlation function. In this way, any set of random variables forms a joint Gaussian distribution.
A formal definition of a Gaussian Process is,
In other words, a Gaussian process is completely specified by its mean function $m(x)$ and covariance function $k(x, x’)$. Since the joint distribution of the data and query points also forms a joint Gaussian distribution, we can write it as
$$ \left[ \begin{array}{c} f \\ f_* \end{array} \right] \sim \mathcal{N}\left(0, \begin{array}{cc} K(X, X) & K(X, X_*) \\ K(X_*, X) & K(X_*, X_*) \end{array} \right) $$The $X$ is the set of data and $X_*$ is the set of query points. $f$ and $f_*$ correspond to observations at $X$ and $X_*$ respectively.
Then we can write the observations that correspond to query points (i.e. the predictions) in terms of $X, K(\cdot, \cdot)$ and $f$.
\[f_* | X_*, X, f \sim \mathcal{N}(K(X_*, X) K(X, X)^{-1} f, K(X_*, X_*) - K(X_*, X) K(X, X)^{-1} K(X, X_*))\]Thus, the predictions form another joint Gaussian distribution with mean $\bar{f}_*$ and covariance $cov(f_*)$.
\[f_* | X_*, X, f \sim \mathcal{GP}(\bar{f}_*, cov(f_*))\]Covariance Functions
One common covariance function is $k(x, x_*) = \exp( -\frac{1}{2} | x - x_* |^2)$, the squared exponential function (SE). The function is stationary $k(x, x_*) = k(x + \tau, x_* + \tau)$, and isotropic (spherical). Since the function is clearly a limited class family of functions, more complex covariance functions can be used as well. For instance, the Bessel functions, the Maten function, or even a neural network can be used as the covariance function of a Gaussian process.
Prediction using Noisy Observations
When the observation contains Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma)$, the observation would be $y = f(x) + \epsilon$.
In this case, we can include the uncertainty in the prediction as follows.
$$ \begin{align} \bar{f}_* & = K(X_*, X) [K(X, X) + \sigma_n^2 \mathbf{I}]^{-1} f\\ cov(f_*) & = K(X_*, X_*) - K(X_*, X) [K(X, X) + \sigma_n^2 \mathbb{I}]^{-1} K(X, X_*) \end{align} $$Decision Theory for Regression
In cases where we want to choose an optimal point that minimizes a given loss function, rather than minimizing a sampled loss function, we can minimize the expected loss over the model as follows.
\[\tilde{R}_\mathcal{L}(y_{guess} | x_*) = \int \mathcal{L}(y_*, y_{guess}) p (y_* | x_*, D) dy_*\]We can replace the loss function with Gaussian Process and when GP is used in regression or decision theory, we call it Gaussian Process Regression.
We can define the Expected Improvedment $a_{EI}(x_{N+1} | \mathcal{D}_N)$ as
\[\int_{\hat{f}_N}^\infty (f - \hat{f}) p(f | x_{N+1}, \mathcal{D}_N) df\]where $\hat{f} = \max_{1 \ge i \ge N} f_i$
Since this is Gaussain Process, the conditional probability will also be Gaussian.
Example Codes
In this section, I’ll provide a simple demo code and results.
Leave a Comment