Jekyll2018-08-28T23:41:58-07:00https://chrischoy.github.io/Computer Vision'er'Chris Choychrischoy@ai.stanford.eduShort Note on Matrix Differentials and Backpropagation2018-01-10T18:55:12-08:002018-01-10T18:55:12-08:00https://chrischoy.github.io/research/Matric-Calculus<p>Mathematical notation is the convention that we all use to denote a concept
in a concise mathematical formulation, yet sometimes there is more than one
way to express the same equation. For example, we can use Leibniz’s notation
$\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y},
\ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations,
we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of
using the definition, we can use the frequency domain representations and
simply solve differential equations using basic algebra.
In this post, I’ll cover a matrix differential notation and how to use
differentials to derive backpropagation functions easily.</p>
<h2 id="differentials">Differentials</h2>
<p>Let’s first define the differential. Let a vector function $f(x)$ be
differentiable at $c$ and the first-order Taylor approximation is</p>
$$
f(c + u) = f(x) + f'(c)u + r(u)
$$
<p>where $r$ denotes the remainder. We denote $\mathsf{d}f(c;u) = u f’(c)$, the
differential of $f$ at $c$ with increment $u$. This of course can also be
denoted simply using partial derivatives.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
$$
\mathsf{d}f(c; u) = (\mathsf{D}f(c))u
$$
<p>where $\mathsf{D}_j f_i(c)$ denotes the partial derivative of $f_i$ with
respect to the $j$-th coordinate at $c$. The matrix $\mathsf{D}f(c)$ is the
Jacobian matrix and it’s the transpose of the gradient of $f$ at $c$.</p>
$$
\nabla f(c) = (\mathsf{D}f(c))^T
$$
<h3 id="chain-rule">Chain Rule</h3>
<p>Let $h = g \circ f$, the differential of $h$ at $c$ is</p>
\begin{align}
\mathsf{d}h(c; u) & = (\mathsf{D}(h(c))u \\
& = (\mathsf{D}g(f(c)))(\mathsf{D}f(c))u = (\mathsf{D}g(f(c)))\mathsf{d}f(c;u) \\
& = \mathsf{d}g(f(c); \mathsf{d}f(c;u))
\end{align}
<p>We can further simplify the notation by replacing $\mathsf{d}f(c;u)$ with
$\mathsf{d}f$ when it unambiguously represents the differential concerning
the input variable.</p>
<h3 id="matrix-function">Matrix Function</h3>
<p>Now let’s extend this to a matrix function. Let $F: S \rightarrow
\mathbb{R}^{m\times p}$ be a matrix function defined on $S \in \mathbb{R}^{n
\times q}$.</p>
$$
\text{vec}F(C+U) = \text{vec} F(C) + F'(C) \text{vec}U + \text{vec}R(U)
$$
<p>We can denote the differential of $F$ at $C$ as</p>
$$
\text{vec}\; \mathsf{d}F(C;U) = F'(C) \text{vec}U
$$
<h2 id="matrix-backpropagation">Matrix Backpropagation</h2>
<p>\todo{Introduce how we use the differential for backpropagation}
Let $A : B = \text{Tr}(A^TB) = \text{vec}(A) \text{vec}(B)^T$, sum of all
elements in $A \circ B$. If we let $F(X) = Y$ and $L$ be the final loss,</p>
\begin{align}
\mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}f \\
& = \mathsf{D} L : \mathcal{L}(\mathsf{d}X) \\
& = \mathcal{L}^*(\mathsf{D} L) : \mathsf{d}X
\end{align}
<p>where we denote $\mathsf{d}Y = \mathcal{L}(\mathsf{d}X)$ and $\mathsf{D}L =
\frac{\partial L}{\partial Y}$. So given gradients from the upper layer, the
gradient with respect to $X$ can easily be computed by finding the function
$\mathcal{L}^*$, the adjoint of $\mathcal{L}$.</p>
<h3 id="example-1-linear-function">Example 1: Linear Function</h3>
<p>Let $Y = f(X) = AX + b$, then,</p>
\begin{align}
\mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}(AX + b) \\
& = \mathsf{D} L : A\mathsf{d}X = \text{Tr}(\mathsf{D}L (A \mathsf{d}X)^T) \\
& = \text{Tr}(\mathsf{D}L \mathsf{d}X^T A^T) = \text{Tr}(A^T \mathsf{D}L \mathsf{d}X^T) \\
& = A^T \mathsf{D}L : \mathsf{d}X
\end{align}
<p>Thus $\mathcal{L}^*(Y) = A^TY$.</p>
<h3 id="example-2-constrained-optimization">Example 2: Constrained Optimization</h3>
<p>We would like to solve the following constrained optimization problem.</p>
\begin{equation*}
\begin{aligned}
& \underset{x}{\text{minimize}}
& & f(x) \\
& \text{subject to}
& & Ax = b.
\end{aligned}
\end{equation*}
<p>The Lagrangian and the primal and dual feasibility equations are</p>
$$
\mathcal{L}(x, \nu) = f(x) + \nu^T(Ax - b) \\
Ax^* = b, \;\; \nabla f(x) + A^T \nu^* = 0
$$
<p>If we take the first order approximation of the primal and dual feasibility
equations,</p>
\begin{align}
Ax + \mathsf{d}(Ax) & = b + \mathsf{d}b\\
Ax + \mathsf{d}Ax + A\mathsf{d}x & = b + \mathsf{d}b\\
\nabla f(x) + A^T \nu + \mathsf{d}(\nabla f(x) + A^T \nu) & = 0 \\
\nabla f(x) + A^T \nu + \nabla^2 f(x) \mathsf{d}x + \mathsf{d}A^T \nu + A^T \mathsf{d}\nu & = 0
\end{align}
<p>Or more concisely,</p>
$$
\begin{bmatrix} \nabla^2 f(x) & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} \mathsf{d}x \\ \mathsf{d}\nu \end{bmatrix} = - \begin{bmatrix} f(x) + A^T \nu + \mathsf{d}A^T\nu \\ Ax - b + \mathsf{d}Ax - \mathsf{d}b \end{bmatrix}
$$
<p>This is the same as the infeasible start Newton method <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we covered the notation for matrix differentials and matrix
backpropagation. Simple notation can ease the burden of derivation and also
lead to fewer mistakes.</p>
<h2 id="references">References</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="http://www.janmagnus.nl/misc/mdc2007-3rdedition">J. Magnus, Matrix Differential Calculus with Applications in Statistics</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf">S. Boyd and L. Vandenberghe, Convex Optimization</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduMathematical notation is the convention that we all use to denote a concept in a concise mathematical formulation, yet sometimes there is more than one way to express the same equation. For example, we can use Leibniz’s notation $\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y}, \ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations, we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of using the definition, we can use the frequency domain representations and simply solve differential equations using basic algebra. In this post, I’ll cover a matrix differential notation and how to use differentials to derive backpropagation functions easily.Regression vs. Classification: Distance and Divergence2018-01-05T15:06:57-08:002018-01-05T15:06:57-08:00https://chrischoy.github.io/research/Regression-Classification<p>In Machine Learning, supervised problems can be categorized into regression or
classification problems. The categorization is quite intuitive as the name
indicate. For instance, if the output, or the target value is a continuous
value, the model tires to regress on the value; and if it is discrete, we want
to predict a discrete value as well. A well-known example of such
classification problem is binary classification such as spam vs. non-spam.
Stock price prediction, or temperature prediction would be good examples of
regression.</p>
<p>To solve such problems, we have to use different methods. First, for regression
problems, the most widely used approach is to minimize the L1 or L2
distance between our prediction and the ground truth target. For classification
problems, 1-vs-all SVMs, multinomial logistic regression, decision forest, or
minimizing the cross entropy are popular choices.</p>
<p>Due to their drastically different treatment, sometimes, it is easy to treat
them as a complete separate problems. However, we can think of the
classification problem as regression problem in a non-Euclidean space and
extend this concept to Wasserstein and Cramer distances.</p>
<h2 id="designing-an-objective-risk-function-for-machine-learning-models">Designing an Objective (Risk) Function for Machine Learning Models</h2>
<p>To make a system that behaves as we expect, we have to design a loss (risk)
function that captures the behavior that we would like to see and define the
<strong>Risk</strong> associated with failures, or the loss function.</p>
<p>For example, let’s look at a typical image classification problem where we
classify an image into a semantic class such as car, person etc. Most datasets
use a mapping from a string (“Car”) to a numeric value so that we can handle
the dataset in a computer easily. For instance, we can assign 0 to “Bird”; 1 to
“Person”; 2 to “Car” etc.</p>
<p>However, the numbers do not have intrinsic meaning. The annotators use such
numbering since it is easy to process on a computer; but not because “Person” +
“Person” gives your “Car” nor because a person is “greater” (>) than a bird.
So, in this case, making a machine learning model to regress such values that
do not have intrinsic meaning would not make much sense.</p>
<p>On the other hand, if the number that we are trying to predict has actual
physical meaning and ordering makes sense (e.g., price, weight, the intensity of
light (pixel value) etc.), it would be reasonable to use the numbers directly
for prediction.</p>
<p>To state this notion clearly, let $y$ be the target value (label, supervision)
associated to an input $x$ and $f(\cdot)$ be a (parametric) function or a
machine learning model. i.e. when we feed $x$ to the function $\hat{y}=f(x)$,
we want the output $\hat{y}$ to approximate the target value $y$. So we need a
measure of <strong>how different</strong> the generated values are from the supervision.
Naturally, we use a distance function to measure how close a target is to the
prediction and we use the distance as the loss function (objective function or
a risk function).</p>
$$
\begin{align}
L(x, y) & = D(\hat{y}, y) \\
\end{align}
$$
<p>where $D(\cdot, \cdot)$ denotes a distance function.</p>
<h2 id="regression-vs-classification">Regression vs. Classification</h2>
<p>Now let’s look at the simplest regression problem: linear regression using
least squares fitting. In this setting, we have noise observation around the
ground truth line, and our task is to estimate the line. In this case, $f(x) =
Ax + b$ and $D(a, b) = ||a - b||_2^2$, square of the L2 norm. This gives also
can be interpreted as the maximum likelihood estimation under Gaussian noise.
However, the L2 norm is not the only distance measure used in regression problems.
L1 norm is sometimes used to enforce sparsity, and the Huber loss is used for
regression problems where outliers do not follow the Gaussian distribution.</p>
$$
D(\hat{y}, y) = \|\hat{y} - y\|_2
$$
<p>Let’s go back to the previous classification problem. Regressing the arbitrary
numeric values (labels) clearly is not the best way to train a machine learning
model as the numeric values do not have intrinsic meaning. Instead, we can use
the probability distribution rather than the arbitrary numeric values. For
example, for an image of a bird, which was class 0, we assign $P_0 = 1$ and 0
for the others: $P_{bird} = [1, 0, 0]$ where the elements are the probability
of the input being a bird, person, and car respectively. Using this
representation, we can train multinomial logistic regression, multi-class SVM.</p>
<h2 id="cross-entropy-and-f-divergence">Cross Entropy and f-Divergence</h2>
<p>However, how should we measure the “distance” between the ground truth label
distribution and the prediction distribution? Or is there a concept of distance
between two distributions? One family of functions that measures the
difference is known as the <strong>Ali-Silvey distances</strong>, or more widely known as
<strong>f-divergence</strong>, provides a measure function. Specifically, one type of the
f-divergence family is more widely used than others, and it is the
Kullback-Leibler divergence. Formally, given two distributions $P_\hat{y}$ and
$P_y$, the KL divergence is defined as</p>
$$
\begin{align}
D(P_\hat{y} || P_y) & = \sum_{i \in \mathcal{Y}} P(\hat{y} = i) \log \frac{P(\hat{y} = i)}{P(y = i)} \\
& = - H(P_y) + H(P_\hat{y}, P_y)
\end{align}
$$
<p>where $H(\cdot)$ is the entropy and $H(\cdot, \cdot)$ is the cross entropy.
In classification problems, where $P_\hat{y}, P_y$ denote prediction and
ground truth respectively, the first term is a constant, so we drop the entropy
and train our prediction model with the cross entropy only. That’s where you
get your cross entropy loss.</p>
<p>However, the KL divergence is not the only divergence. In fact, any convex
function $f: (0, \infty) \rightarrow \mathbb{R}$ such that $f(1) = 0$ can
define a divergence function.</p>
$$
D_f(P || Q) = \mathbb{E}_Q \left[ f \left( \frac{dP}{dQ} \right) \right]
$$
<p>For example, if we use $f(x) = \frac{1}{2}|x - 1|$, we have the Total Variation divergence.</p>
$$
\begin{align}
D_{TV}(P || Q) & = \frac{1}{2} \mathbb{E}_Q \left[ \left| \frac{dP}{dQ} - 1 \right| \right] \\
& = \frac{1}{2} \int |P - Q| = \frac{1}{2} ||P - Q||_1
\end{align}
$$
<p>One thing to note is that the KL divergence is not a proper <em>metric</em> as it is
asymmetric and violates the triangle inequality.</p>
<h2 id="wasserstein-distance-cramer-distance">Wasserstein Distance, Cramer Distance</h2>
<p>However, f-divergence is not the only way to measure the difference between two
distributions. In <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, the authors propose that f-divergence does not capture
our regular notion of distance accurately and propose to use a different
distance and led an interesting discussion in adversarial training.
Let’s first look at other “distance” functions that do not belong to the
f-divergence family.</p>
<p>First, the Wasserstein distance, also known as the probabilistic Earth Mover’s
Distance, computes the minimum mass that we need to move to match a
probability distribution to another.
\begin{align}
W_1(P, Q) = \inf \mathbb{E} [|x - y|]
\end{align}
</p>
<p>The infimum is over the joint distribution whose marginals are $P$ and $Q$. $x$ and
$y$ are defined over the space where $P$ and $Q$ have non zero support. One
of great follow up works <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> proposed to use yet another different distance
function, Cramer
distance, to remove sampling bias in the distance function. The Cramer
distance is simply the squared version of it</p>
\begin{align}
W_2(P, Q) = \left( \inf \mathbb{E} [|x - y|^2] \right)^{1/2}
\end{align}
<h2 id="conclusion">Conclusion</h2>
<p>Categorizing supervised problems into classification or regression can help we clearly understand the
problem, but sometimes it can limit our imagination and also limit the set of distance
functions that we can use.
Rather, in this post, we discussed how classification and regression could be understood
from how we measure differences. Classification by measuring difference using
f-divergence or even probabilistic distances and regression as Euclidean
distances. They are merely distances that measure the difference between a target
and a prediction. There are more popular distance functions, but the
set of the distance function is not set in stone. Sometimes, by defining the
distance function in a clever way, we can improve our ML model!</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Arjovsky et al., Wasserstein Generative Adversarial Networks, 2017 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Bellemare et al., The Cramer Distance as a Solution to Biased Wasserstein Gradients, 2017 <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduIn Machine Learning, supervised problems can be categorized into regression or classification problems. The categorization is quite intuitive as the name indicate. For instance, if the output, or the target value is a continuous value, the model tires to regress on the value; and if it is discrete, we want to predict a discrete value as well. A well-known example of such classification problem is binary classification such as spam vs. non-spam. Stock price prediction, or temperature prediction would be good examples of regression.Data Processing Inequality and Unsurprising Implications2018-01-04T21:11:58-08:002018-01-04T21:11:58-08:00https://chrischoy.github.io/research/data-processing-inequality-and-unsurprising-implications<p>We have heard enough about the great success of neural networks and how they
are used in real problems. Today, I want to talk about how it was so successful
(partially) from an information theoretic perspective and some lessons that we all
should be aware of.</p>
<h2 id="traditional-feature-based-learning">Traditional Feature Based Learning</h2>
<p>Before we figured out how to train a large neural network efficiently and fast,
traditional methods (such as hand designed features + shallow models like a
random forest, SVMs) have dominated Computer Vision. As you have guessed,
traditional method first starts from extracting features from an image, such
as the Histogram of Oriented Gradients (HOG), or Scale-Invariant
Feature Transform (SIFT) features. Then, we use the supervised metood of our choice
to train the second part of the model for prediction. So, what we are learning
is only from the extracted feature the prediction.</p>
$$
\text{Image} \rightarrow \text{Features} \underset{f(\cdot; \theta)}{\rightarrow} \hat{y}
$$
<p>The information from the image is bottlenecked by the quality of the feature and thus
many research had led to better, faster features. Here, to illustrate that the
learnable parameters are only in the second stage, I put $\theta$ in a function
below the second arrow.</p>
<h2 id="neural-network-as-an-end-to-end-system">Neural Network as an End-to-End system</h2>
<p>Unlike the traditional approach, the neural network based method starts
directly from the original inputs (of course, some preprocessing like centering,
and normalization, but they are reversible). We assume that the neural network
is a universal function approximator and optimize the parameters inside it to
approximate a complex function like the color of pixels to a semantic class!</p>
$$
\text{Image} \underset{f(\cdot; \theta)}{\rightarrow} \hat{y}
$$
<p>Unlike before, we are making a system that does not involve an intermediate
representation. Then, the natural questions that follow are why such system is
strictly better than the one that involves intermediate representation?, and is
it always the case?</p>
<h2 id="data-processing-inequality">Data Processing Inequality</h2>
<p>To generalize our discussion, let’s assume $X, Y, Z$ be the random variables
that form a Markov chain.</p>
$$
X \rightarrow Y \rightarrow Z
$$
<p>You can think of each arrow as a complex system that generates the best approximation
of whatever we want for each step. According to the data processing inequality,
the mutual information between $X$ and $Z$, $I(X; Z)$ cannot be greater than
that between $X$ and $Y$, $I(X; Y)$.</p>
$$
I(X;Y) \ge I(X;Z)
$$
<p>In other words, the information can only be lost and never increases as we
process it. For example in the traditional method, we extract feature $Y$ from an image $X$ with
a deterministic function. Given the feature, we
estimate the outcome $Z$. So, if we lost some information from the first feature
extraction stage, we cannot regain the lost information from the second stage.</p>
<p>However, in an end-to-end system, we do not enforce an intermediate
representation and thus remove $Y$ altogether.</p>
<h2 id="case-studies">Case Studies</h2>
<p>Now that we are equipped with the knowledge, let’s delve into some scenarios where you
should swing your big knowledge around. Can you tell your friendly colleague
ML what went wrong or how to improve the model?</p>
<h3 id="case-1-rgb-rightarrow-thermal-image-rightarrow-pedestrian-detection">Case 1: RGB $\rightarrow$ Thermal Image $\rightarrow$ Pedestrian Detection</h3>
<p>ML wants to localize pedestrians from RGB images.</p>
<p>ML: It is easier to predict pedestrians from thermal images, but thermal
images are difficult to acquire as the thermal cameras are not as common as
regular RGB cameras. So I will first predict thermal images from regular
images, then it would be easier to find pedestrian.</p>
<h3 id="case-2-monocular-image-rightarrow-3d-shape-prediction-rightarrow-weight">Case 2: Monocular Image $\rightarrow$ 3D shape prediction $\rightarrow$ Weight</h3>
<p>Again, ML is working on weight prediction from a monocular image (just a
regular image).</p>
<p>ML: Weight is a property associated with the shape of the object. If we can
predict the shape of an object first from an image, then predicting weight from
a 3D shape would be easier!</p>
<p>You can guess what went wrong probably. However, if we slightly tweak the
setting, we could improve the model. For example, in the case 1, instead of
feeding the RGB image only, RGB + Thermal $\rightarrow$ Pedestrian Detection,
would easily improve the performance.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We discussed how the data processing inequality could shed light on the success
of the neural network and the importance of an end-to-end system. However,
problems that you want to solve might not be as clear-cut as I
illustrated here. There are a lot of hair-splitting details that make the
difference. However, it is always important to remind what is theoretically
possible and maybe such split-second thought could save you a week of
implementation!</p>Chris Choychrischoy@ai.stanford.eduWe have heard enough about the great success of neural networks and how they are used in real problems. Today, I want to talk about how it was so successful (partially) from an information theoretic perspective and some lessons that we all should be aware of.Learning Gaussian Process Covariances2017-12-15T14:49:11-08:002017-12-15T14:49:11-08:00https://chrischoy.github.io/research/learning-gaussian-process-covariances<p>A Gaussian process is a non-parametric model which can represent a complex
function using a growing set of data. Unlike a neural network, which can also
learn a complex functions, a Gaussian process can also provide variance
(uncertainty) of a data since the model is based on a simple Gaussian
distribution.</p>
<p>However, like many machine learning models, we have to define a set of
functions to define a Gaussian process. In a Gaussian process, the function of
uttermost importance is a covariance function. It is common to use a
predetermined function with fixed constants for a covariance, but it is more
pragmatic to learn a function rather than search high dimensional space using
sampling based methods to find the best set of parameters for the covariance
function.</p>
<p>In this post, I summarize a simple gradient based method and a scalable version
of learning a covariance function.</p>
<h2 id="brief-summary-of-a-gaussian-process">Brief Summary of a Gaussian Process</h2>
<p>In a Gaussian process, we assume that all observations are sample from a
Gaussian distribution and any subset of the random variables (observations or
predictions at a new data point) will follow a Gaussian distribution with
specific mean $m(\mathbf{x})$ and covariance $K(X, X)$.</p>
<p>Let $\mathcal{D} = { (\mathbf{x}_i, y_i) }_i^n$ be a dataset of of input
$\mathbf{x}_i$ and corresponding output $y_i$. We assume that the observation
is noisy and the noise free output at value $\mathbf{x}$ is
$\mathbf{f}(\mathbf{x})$, i.e., $\mathbf{y}(\mathbf{x}) =
\mathbf{f}(\mathbf{x}) + \epsilon$.</p>
<p>Given the Gaussian process assumption, all subsets follow a Gaussian distribution and thus, the entire dataset can be represented using a single Gaussian distribution.</p>
$$
\mathbf{f} | X, \mathbf{y} \sim \mathcal{N}(\mathbf{\bar{f}}, K(X, X))
$$
<p>Please refer to <a href="https://chrischoy.github.io/research/gaussian-process-regression/">the previous post</a> about a Gaussian process for details.</p>
<h2 id="learning-the-covariance-kx-x">Learning the Covariance $K(X, X)$</h2>
<p>In many cases, the covariance function $K(X, X)$ is predefined as a simple
function such as a squared exponential.
There are many variants of the function, but in its simplest form, the squared
exponential function contains at least two hyper-parameters, $c$ and $\sigma$</p>
$$
k(\mathbf{x}_1, \mathbf{x}_2) = c \exp \left( \frac{|\mathbf{x}_1 - \mathbf{x}_2|^2}{\sigma^2} \right).
$$
<p>We can use simple grid search or MCMC to find the optimal hyper-parameters.
However, as the function gets more complex, finding optimal hyper-parameters
can become a daunting task pretty quickly as the dimension gets larger.</p>
<p>Instead, we can use a simple gradient descent based method with multiple
initializations to find the optimal hyper-parameters.</p>
<h3 id="gradient-of-the-posterior-probability">Gradient of the Posterior Probability</h3>
<p>To take gradient steps w.r.t. hyper-parameters, we need to compute the
gradients w.r.t. hyper-parameters. Let all the hyper-parameters in a covariance
function as $\theta$.</p>
$$
\begin{align}
\log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \log |K| - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) + c\\
\nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \mathrm{Tr} \left( K^{-1} \frac{\partial K}{\partial \theta_i} \right) - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}})
\end{align}
$$
<p>Given the gradient, we can use a gradient based optimizer of our choice to
learn the hyper-parameters (or simply parameters) of a Gaussian process.</p>
<h2 id="scalability-of-the-gradient">Scalability of the Gradient</h2>
<p>In the previous section, we assumed that we can compute the gradient exactly.
However, if the dimension of the vector $y$, $n$ increases, it might not be
possible to compute the above gradient in a reasonable time and cost. Let’s
analyze the computational complexity of each term.</p>
<p>First, note that $K^{-1}y$ requires solving a linear system which takes
$O(n^3)$ complexity if we use a decomposition based method or $O(\sqrt{\kappa}
n^2)$ if we use an iterative method like Conjugate Gradient, where $\kappa$ is
the condition number of $K$.</p>
<p>Now, we can compute the complexity of each term. The first term, $K^{-1}
\frac{\partial K}{\partial \theta_i}$, can take $O(\sqrt{\kappa} n^3)$ if we
use iterative method or $O(n^3)$ if we can cache decomposition. The second
term would only take $O(\sqrt{\kappa} n^2)$ as solving the linear system takes
the most time.</p>
<h2 id="sampling-the-gradient">Sampling the Gradient</h2>
<p>As the dimension of the problem gets larger, it would be impractical to solve
the system using a matrix decomposition and we need to resort to an approximate
method. The paper by Filippone and Engler <sup id="fnref:2"><a href="#fn:2" class="footnote">1</a></sup> propose to sample unbiased
gradient using i.i.d. $N_s$ vectors. For example, let $r^j$ be the $j$th
element of the vector $\mathbf{r}$. If we set $r^j \in {-1, 1}$ with equal
probability, $\mathbb{E}(\mathbf{r}\mathbf{r}^T) = I$ and</p>
$$
\begin{align}
\mathrm{Tr}\left(K^{-1}\frac{\partial K}{\partial \theta_i}\right)
& =\mathrm{Tr}\left( K^{-1}\frac{\partial K}{\partial \theta_i} \mathbb{E} \left[ \mathbf{r} \mathbf{r}^T \right] \right) \\
& = \mathbb{E} \left[ \mathbf{r}^T K^{-1}\frac{\partial K}{\partial \theta_i} \mathbf{r} \right]
\end{align}
$$
<p>We can solve $K^{-1}\mathbf{r}$ easily using Conjugate Gradient and thus, the
complexity of the above equation is $O(\sqrt{\kappa}n^2 N_s)$ where $N_s$ is
the number of samples. Finally, the gradient becomes</p>
$$
\nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) \approx - \frac{1}{2N} \sum_i^N \mathbf{r}_i^T K^{-1} \frac{\partial K}{\partial \theta_i} \mathbf{r}_i - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}})
$$
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we covered how to train a covariance function in a Gaussian
process using gradient based methods. As the method is not very scalable, we
also discussed how to use random samples to approximate the gradient.</p>
<h2 id="references">References</h2>
<div class="footnotes">
<ol>
<li id="fn:2">
<p><cite>M. Filippone and R. Engler, Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE), ICML’15</cite> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduA Gaussian process is a non-parametric model which can represent a complex function using a growing set of data. Unlike a neural network, which can also learn a complex functions, a Gaussian process can also provide variance (uncertainty) of a data since the model is based on a simple Gaussian distribution.DeformNet: Free-Form Deformation Network for 3D Shape Reconstruction from a Single Image2017-08-18T01:18:24-07:002017-08-18T01:18:24-07:00https://chrischoy.github.io/preprint/deformnet<h2 id="abstract">Abstract</h2>
<p>3D reconstruction from a single image is a key problem in multiple applications ranging from robotic manipulation to augmented reality. Prior methods have tackled this problem through generative models which predict 3D reconstructions as voxels or point clouds. However, these methods can be computationally expensive and miss fine shape details. We introduce a new differentiable layer for 3D data deformation and use it in DeformNet to learn free-form deformations usable on multiple 3D data formats. DeformNet takes an image input, searches the nearest shape template from the database, and deforms the template to match the query image. We evaluate our approach on the ShapeNet database and show that - (a) Free-Form Deformation is a powerful new building block for Deep Learning models that manipulate 3D data (b) DeformNet uses this FFD layer combined with shape retrieval for smooth and detail-preserving 3D reconstruction of qualitatively plausible point clouds with respect to a single query image (c) compared to other state-of-the-art 3D reconstruction methods, DeformNet quantitatively matches or outperforms their benchmarks by significant margins.</p>
<ul>
<li><a href="https://deformnet-site.github.io/DeformNet-website/">Project page</a></li>
<li><a href="https://arxiv.org/abs/1708.04672">ArXiv</a></li>
</ul>Andrey KurenkovAbstractExpectation Maximization and Variational Inference (Part 2)2017-03-23T09:05:51-07:002017-03-23T09:05:51-07:00https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference-2<p>In the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference/">previous post</a>, we covered
variational inference and how to derive update equations. In this post, we will
go over a simple Gaussian Mixture Model with the Dirichlet prior distribution
over the mixture weight.</p>
<p>Let $x_n$ be a datum and $z_n$ be the latent variable that indicates the
assignment of the datum $x_n$ to a cluster $k$, $z_{nk} = I(z_n = k)$. We
denote the weight of a cluster $k$ with $\pi_k$ and the natural parameter of
the cluster as $\eta_k$.</p>
<p>The graphical model of the mixtures looks like the following.</p>
<figure>
<img style="width:30%" class="align-center" src="https://chrischoy.github.io/images/research/graphical_model.png" />
</figure>
<p>Formally, we define the generative process
$p(\pi|\alpha), p(z_n; \pi_0), p(x_n | z_z, \eta)$.
Unlike Bishop <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> and Blei et al. <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>, we will not use prior over the natural
parameter $\eta$ for simplicity. The notation and the model are similar to that
used in Blei et al. <sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup>. With overloading notation,</p>
$$
\begin{align}
p(\pi | \alpha_0) & = \mathrm{Dir}(\pi; \alpha_0) \\
p(z_n | \pi) & = \prod_k \pi_k^{z_{nk}} \\
p(x_n | z_n, \eta) & = \prod_k \mathcal{N}(x_n ; \eta_k)^{z_{nk}}
\end{align}
$$
<p>And the log joint probability is</p>
$$
\log p(\mathbf{x}, \mathbf{z} ; \eta, \alpha_0) = \sum_n \sum_k z_{nk} [\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \log \mathrm{Dir}(\pi; \alpha_0)
$$
<h2 id="meanfield-approximation">Meanfield Approximation</h2>
<p>In this example, let’s use the meanfield approximation and make the posterior
distribution of the latent variables $z$ and $\pi$ independent. i.e.</p>
$$
q(z, \pi) = q(z)q(\pi)
$$
<p>From the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference/">previous post</a>, we know that
the optimal distribution $q(\cdot)$ that maximizes the evidence lower bound
is</p>
$$
\log q(w_i) = \mathbb{E}_{w_{j}, j\neq i} \log p(x, \mathbf{w})
$$
<p>where $w_i$ is an arbitrary latent variable. Thus, we can use the same
technique and find $q(z)$ and $q(\pi)$.</p>
$$
\begin{align*}
\log q(z) & = \sum_n \sum_k z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \mathbb{E}\log \mathrm{Dir}(\pi; \alpha_0) \\
& = \sum_n \sum_k z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + C_1 \\
\log q(\pi) & = \sum_n \sum_k \mathbb{E}z_{nk} [\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \log \mathrm{Dir}(\pi; \alpha_0) \\
& = \sum_n \sum_k \mathbb{E}z_{nk} \log \pi_k + \log \mathrm{Dir}(\pi; \alpha_0) + C_2
\end{align*}
$$
<p>We can easily compute the expectations of the latent variables.</p>
$$
\begin{align*}
\mathbb{E}\log \pi_k & = \psi(\alpha_k) - \psi(\sum_k \alpha_k) = \log \tilde{\pi}_k \\
\mathbb{E}z_{nk} & = q(z_{nk}=1) \propto \exp\left\{\log \tilde{\pi}_k + \log \mathcal{N}(x_n; \eta_k)\right\} = \rho_{nk} \\
\mathbb{E}z_{nk} & = \frac{\rho_{nk}}{\sum_l \rho_{nl}} = r_{nk}
\end{align*}
$$
<p>where $\alpha_k$ are the parameters of the latent variable $\pi_k$ and $\psi$
is the digamma function. We get the first equation from the property of the
Dirichlet distribution. Given the expectations, we can simplify the equations
and get update rules.</p>
<h2 id="expectation-and-maximization">Expectation and Maximization</h2>
<p>First, let’s examine the $\log q(\pi)$.</p>
$$
\begin{align*}
\log q(\pi) & = \sum_n \sum_k r_{nk} \log \pi_k + \log \mathrm{Dir}(\pi; \alpha_0) + C_2 \\
& = \sum_n \sum_k r_{nk} \log \pi_k + (\alpha_0 - 1) \log \pi_k + C_3 \\
& = \sum_k (\alpha_0 + \sum_n r_{nk} - 1) \log \pi_k + C_3 \\
& = \log \mathrm{Dir}(\pi| \alpha)
\end{align*}
$$
<p>Thus, $\alpha_k = \alpha_0 + \sum_n r_{nk}$. The $z$ update equation is given
above. Finally, for $\eta$, we differentiate $p(x;\eta)$ with respect to
$\eta$ to find the update rule.</p>
$$
\begin{align*}
\log p(x; \eta) & = \mathop{\mathbb{E}}_{z, \pi} \log p(x, z, \pi; \eta) \\
& = \sum_n \sum_k \mathbb{E} z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \mathbb{E}\log \mathrm{Dir}(\pi; \alpha_0) \\
\nabla_{\eta_k} \log p(x; \eta) & = \sum_n r_{nk} \nabla_{\eta_k} \log \mathcal{N}(x_n ; \eta_k) \\
& = \sum_n r_{nk} \nabla_{\eta_k} \left( \frac{1}{2} \log |\Lambda_k| - \frac{1}{2} \mathrm{Tr}\left(\Lambda_k (x_n - \mu_n)(x_n - \mu_n)^T \right) \right) \\
\nabla_{\mu_k} \log p(x; \eta) & = \sum_n r_{nk} \Lambda_k (x_n - \mu_n) = 0 \\
\nabla_{\Lambda_k} \log p(x; \eta) & = \frac{1}{2} \sum_n r_{nk} \nabla_{\Lambda_k} \log |\Lambda_k| - r_{nk} \nabla_{\Lambda_k} \mathrm{Tr}\left(\Lambda_k (x_n - \mu_n)(x_n - \mu_n)^T \right) \\
& = \frac{1}{2} \sum_n r_{nk} \Lambda_k^{-1} - r_{nk} (x_n - \mu_n)(x_n - \mu_n)^T = 0 \\
\end{align*}
$$
<p>From the above equations, we can get</p>
$$
\begin{align}
N_k & = \sum_n r_{nk} \\
\mu_k & = \frac{1}{N_k} \sum_n r_{nk} x_n \\
\Lambda_k & = \frac{1}{N_k} \sum_n r_{nk} (x_n - \mu_k)(x_n - \mu_k)^T
\end{align}
$$
<h2 id="evidence-lower-bound">Evidence Lower Bound</h2>
<p>Given the final solutions $r_{nk}$, $\log \tilde{\pi}_k$, $\alpha’$, we can
derive the negative of the variational free energy, or the Evidence Lower Bound (ELBO).</p>
$$
\begin{align*}
ELBO & = \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x, z, \pi)}{q(z, \pi)} \\
& = \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x | z) p(z| \pi) p(\pi)}{q(z)q(\pi)} - \mathbb{E}_z\mathbb{E}_z \log q(z)q(\pi) \\
& = \underbrace{\mathbb{E}_z \log p(x | z)}_{\mbox{(a)}}
+ \underbrace{\mathbb{E}_z \mathbb{E}_\pi \log p(z | \pi) p(\pi) }_{\mbox{(b)}}
+ \underbrace{H(q(z))}_{\mbox{(c)}}
+ \underbrace{H(q(\pi))}_{\mbox{(d)}}
\end{align*}
$$
<p>where $H(\cdot)$ is the entropy. Each of the terms can be computed</p>
$$
\begin{align*}
\mbox{(a)} & = \mathbb{E}_z \log p(x | z) \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \sum_k z_{nk} \log \mathcal{N}_k(x_n) \\
& = \sum_n \sum_k r_{nk} \log \mathcal{N}_k(x_n) \\
\mbox{(b)} & = \mathbb{E}_z \mathbb{E}_\pi \log p(z | \pi) p(\pi) \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \log \frac{1}{B(\mathbb{\alpha}_0)} \prod_k \pi_k^{z_{nk}} \pi_k^{\alpha_0 - 1} \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \sum_k (z_{nk} + \alpha_0 - 1) \log \pi_k - \log B(\mathbb{\alpha}_0) \\
& = \sum_n \sum_k (\mathbb{E}_z z_{nk} + \alpha_0 - 1) \mathbb{E}_\pi \log \pi_k - \log B(\mathbb{\alpha}_0) \\
& = \sum_k \left( \sum_n r_{nk} + \alpha_0 - 1 \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) \\
\mbox{(c)} & = - \mathbb{E}_z \log q(z) \\
& = - \mathbb{E}_z \sum_n \sum_k z_{nk} \log r_{nk} \\
& = - \sum_n \sum_k r_{nk} \log r_{nk} \\
\mbox{(d)} & = - \mathbb{E}_\pi \log q(\pi) \\
& = - \mathbb{E}_\pi \log \frac{1}{B(\mathbb{\alpha}')} \prod_k \pi_k^{\alpha'_k - 1} \\
& = - \sum_k (\alpha'_k - 1) \log \mathbb{E}_\pi \pi_k + \log B(\mathbb{\alpha}') \\
& = - \sum_k (\alpha'_k - 1) \log \tilde{\pi}_k + \log B(\mathbb{\alpha}')
\end{align*}
$$
<p>Since $\log r_{nk} = \log \tilde{\pi}_k + \log \mathcal{N}_k(x_n) - \log \left( \sum_l \exp \{\log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right)$,</p>
$$
\begin{align*}
\mbox{(a) + (c)} & = \sum_n \sum_k r_{nk} \left(\log \mathcal{N_k}(x_n) - \log r_{nk} \right) \\
& = \sum_n \sum_k r_{nk} \left(- \log \tilde{\pi}_k + \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \right)\\
& = - \sum_k N_k \log \tilde{\pi}_k + \sum_n \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \\
\mbox{(b) + (d)} & = \sum_k \left( \sum_n r_{nk} + \alpha_0 - 1 \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) \\
& - \sum_k (\alpha'_k - 1) \log \tilde{\pi}_k + \log B(\mathbb{\alpha}') \\
& = \sum_k \left( \sum_n r_{nk} + \alpha_0 - \alpha'_k \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) + \log B(\mathbb{\alpha}') \\
& = \log B(\mathbb{\alpha}') - \log B(\mathbb{\alpha}_0)
\end{align*}
$$
<p>Thus,</p>
$$
\begin{align*}
ELBO = & \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x, z, \pi)}{q(z, \pi)} \\
= & - \sum_k N_k \log \tilde{\pi}_k + \sum_n \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \\
& + \log B(\mathbb{\alpha}') - \log B(\mathbb{\alpha}_0) \\
\end{align*}
$$
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Blei, <a href="http://www.cs.columbia.edu/~blei/papers/BleiJordan2004.pdf">Variational Inference for Dirichlet Process Mixtures, Bayesian Analysis 2006</a> <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduIn the previous post, we covered variational inference and how to derive update equations. In this post, we will go over a simple Gaussian Mixture Model with the Dirichlet prior distribution over the mixture weight.Scene Graph Generation by Iterative Message Passing2017-03-14T02:22:45-07:002017-03-14T02:22:45-07:00https://chrischoy.github.io/publication/scene-graph<h2 id="abstract">Abstract</h2>
<p>Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel end-to-end model that generates such structured scene representation from an input image. The model solves the scene graph inference problem using standard RNNs and learns to iteratively improves its predictions via message passing. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods on generating scene graphs using Visual Genome dataset and inferring support relations with NYU Depth v2 dataset.</p>
<ul>
<li><a href="https://arxiv.org/abs/1701.02426">ArXiv</a></li>
</ul>Danfei XuAbstractDESIRE: Deep Stochastic IOC RNN Encoder-decoder for Distant Future Prediction in Dynamic Scenes with Multiple Interacting Agents2017-03-14T02:22:45-07:002017-03-14T02:22:45-07:00https://chrischoy.github.io/publication/desire<h2 id="abstract">Abstract</h2>
<p>We introduce a Deep Stochastic IOC1 RNN Encoder- decoder framework, DESIRE, with a conditional Variational Auto-Encoder and multiple RNNs for the task of future predictions of multiple interacting agents in dynamic scenes. Accurately predicting the location of objects in the future is an extremely challenging task. An effective prediction model must be able to 1) account for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) fore-see the potential future outcomes and make a strategic prediction based on that, and 3) reason not only from the past motion history, but also from the scene context as well as the interactions among the agents.
DESIRE can address all aforementioned challenges in a single end-to-end trainable neural network model, while being computationally efficient. The model first obtains a diverse set of hypothetical future prediction samples employing a conditional variational auto-encoder, which are ranked and refined via the following RNN scoring-regression module. We evaluate our model on two publicly available datasets: KITTI and Stanford Drone Dataset. Our experiments show that the proposed model significantly improves the prediction accuracy compared to other baseline methods.</p>Namhoon LeeAbstractExpectation Maximization and Variational Inference (Part 1)2017-02-26T02:32:56-08:002017-02-26T02:32:56-08:00https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference<p>Statistical inference involves finding the right model and parameters that represent
the distribution of observations well. Let $\mathbf{x}$ be the observations and
$\theta$ be the unknown parameters of a ML model. In maximum likelihood
estimation, we try to find the $\theta_{ML}$ that maximizes the probability of
the observations using the ML model with the parameters:</p>
$$
\hat{\theta}_{ML} = \underset{\theta}{\arg\!\max} \; p(\mathbf{x}; \theta)
$$
<p>Typically, the problem requires few assumptions to solve the above optimization
efficiently. One trick is to introduce latent variables $\mathbf{z}$ that
break down the problem into smaller subproblems. For instance, in the <a href="https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model">Gaussian
Mixture
Model</a>, we
can introduce the cluster membership assignment as random variables $z_i$ for
each datum $x_i$, which greatly simplifies the model ($p(x_i | z_i=k) \sim
\mathcal{N}(\mu_k, \sigma_k)$).</p>
$$
p(\mathbf{x};\theta) = \int p(\mathbf{x}, \mathbf{z}; \theta) d\mathbf{z}
$$
<p>However, the above integration is, in many cases, intractable and can be either
approximated using stochastic sampling (Monte Carlo methods) or we can simply bypass
the computation using few assumptions. The second method is called
variational inference, coined after the <a href="https://en.wikipedia.org/wiki/Calculus_of_variations">calculus of
variations</a>, which we will
go over in this post.</p>
<h2 id="evidence-lower-bound-elbo">Evidence Lower Bound (ELBO)</h2>
<p>There are many great tutorials for variational inference, but I found the
tutorial by Tzikas et al.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> to be the most helpful. It follows the steps of
Bishop et al.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> and Neal et al.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> and starts the introduction by formulating
the inference as the Expectation Maximization. Here, we will summarize the steps in Tzikas
et al.<sup id="fnref:1:1"><a href="#fn:1" class="footnote">1</a></sup> and elaborate some steps missing in the paper. Let $q(z)$ be a
probability distribution on $z$. Then,</p>
$$
\begin{align*}
\ln p(x; \theta) & = \int q(z) \ln p(x; \theta) dz \\
& = \int q(z) \ln \Big( \frac{p(x; \theta) p(z | x; \theta)}{p(z | x; \theta)} \Big) dz \\
& = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{p(z | x; \theta)} \Big) dz \\
& = \int q(z) \ln \Big( \frac{p(x, z; \theta) q(z)}{p(z | x; \theta) q(z)} \Big) dz \\
& = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz
- \int q(z) \ln \Big( \frac{p(z | x; \theta)}{q(z)} \Big) dz \\
& = F(q, \theta) + KL(q || p)
\end{align*}
$$
<p>where $F(q, \theta)$ known as the evidence lower bound or ELBO, or the negative
of the variational free energy. $KL(\cdot || \cdot)$ is the Kullback-Leibler
divergence. Since the KL-divergence is non-negative,</p>
$$
\ln p(x; \theta) \ge F(q, \theta)
$$
<p>The ELBO provides a lower bound for the marginal likelihood. Instead of
maximizing the marginal likelihood directly, the Expectation
Maximization (EM) and variational inference maximize the variational lower bound.</p>
<h2 id="expectation-maximization">Expectation Maximization</h2>
<p>Let’s assume that we can find $p(z | x; \theta^{OLD})$ analytically (for
the Gaussian Mixture Model, this is just a softmax). Then, we can simply
substitute $q(z) = p(z | x; \theta^{OLD})$. The ELBO becomes</p>
$$
\begin{align*}
F(q, \theta) & = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz \\
& = \int q(z) \ln p(x, z; \theta) dz - \int q(z) \ln q(z) dz \\
& = \int p(z | x; \theta^{OLD}) \ln p(x, z; \theta) dz \\
& \quad - \int p(z | x; \theta^{OLD}) \ln p(z | x; \theta^{OLD}) dz \\
& = Q(\theta, \theta^{OLD}) + H(q)
\end{align*}
$$
<p>The second term $H(z|x)$ is the entropy of $z$ given $x$ and is a function of
$\theta^{OLD}$. It is a constant with respect to $\theta$ and we do not take
the term into account while maximizing the ELBO.</p>
<p>The EM algorithm can be succinctly summarized as ${\arg\!\max}_\theta Q(\theta,
\theta^{OLD})$.</p>
<ul>
<li>E-step: compute $p(z | x; \theta^{OLD})$</li>
<li>M-step: evaluate ${\arg\!\max}_\theta \int p(z | x; \theta^{OLD}) \ln p(x, z; \theta) dz$</li>
</ul>
<p>For example, the EM for the Gaussian Mixture Model consists of an expectation step
where you compute the soft assignment of each datum to K clusters, and a maximization
step which computes the parameters of each cluster using the assignment.
However, for complex models, we cannot use the EM algorithm.</p>
<h2 id="variational-expectation-maximization">Variational Expectation Maximization</h2>
<p>For a simple model, an analytical solution for $p(z | x; \theta^{OLD})$ exists and
thus computing $q(z) = p(z | x; \theta^{OLD})$ is tractable.
However, it is not possible in general as the model gets more complex.
Instead, we approximate the posterior probability using a simpler model. For
example, we assume that a set of latent variables is independent of the rest of
the latent variables given $x$. Such independence reduces complexity and allows
us to deduce the analytic form of the EM.</p>
<p>We can even enforce full independence among all latent variables given $x$,
i.e., $z_i,\perp z_j$ for $i \neq j$. This assumption, known as the mean field
approximation, allows us to compute the update rules for each latent variable
in isolation and has been successful in many problems. We will go over
variational inference using the mean field approximation, but the following
technique can be used for models with more complex dependency.</p>
<p>Let $q(z) = \prod_i q(z_i)$. Then, the ELBO can be
factorized into $z_j$ and the rest of the latent variables.</p>
$$
\begin{align*}
F(q, \theta) & = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz \\
& = \int \prod_i q(z_i) \ln p(x, z; \theta) dz - \sum_i \int q(z_i) \ln q(z_i) dz_i \\
& = \int q(z_j) \int \Big( \prod_{i \neq j} q(z_i) \ln p(x, z; \theta) \Big) \prod_{i \neq j} dz_i dz_j \\
& \quad - \int q(z_j) \ln q(z_j) dz_j - \sum_{i \neq j} \int q(z_i) \ln q(z_i) dz_i \\
& = \int q(z_j) \ln \Big( \frac{\exp(\langle \ln p(x, z; \theta)\rangle_{i \neq j})}{q(z_j)} \Big) dz_j \\
& \quad - \sum_{i \neq j} \int q(z_i) \ln q(z_i) dz_i \\
& = \int q(z_j) \ln \Big( \frac{\tilde{p}_{i\neq j}}{q(z_j)} \Big) dz_j + H(z_{i\neq j}) + c\\
& = - KL(q_j || \tilde{p}_{i\neq j}) + H(z_{i\neq j}) + c
\end{align*}
$$
<p>where $\langle \cdot \rangle_i$ indicates the expectation over the latent
variable $z_i$. Since $\exp(\langle \ln p(x, z; \theta)\rangle_{i \neq j})$ is
not a proper pdf, the constant $c$ is added to adjust it to become a proper
pdf. Since the KL-divergence is non-negative, the ELBO
is maximized when $KL(\cdot || \cdot) = 0$ which happens when $q(z_j) =
\tilde{p}_{i\neq j} = \frac{1}{Z} \exp \langle \ln p(x, z; \theta)\rangle_{i
\neq j}$.</p>
<p>Similarly, in the variation EM,</p>
<ul>
<li>E-step: evaluate $q^*(z_j) = \frac{1}{Z} \exp \langle \ln p(x, z;
\theta)\rangle_{i \neq j}$ for all $j$,
<ul>
<li>$q^{NEW} = \prod_i q_i^*$</li>
</ul>
</li>
<li>M-step: find $\theta = {\arg\!\max}_\theta F(q^{NEW}, \theta)$</li>
</ul>
<p>In practice, $q^*$ is the optimal probability that maximizes $F(q, \theta)$.
And $q^*$ has the form of known probability distribution functions. Thus,
$\theta^{NEW}$ would simply be the parameters of the probability distribution
function after factorizing the other probability distribution. However, if the
function $q^*$ cannot be simplified into a known form, solving the KKT
condition and setting the derivative of the ELBO would give you a
solution.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The variational EM gives us a way to bypass computing the partition function and
allows us to infer the parameters of a complex model using a deterministic
optimization step. In the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference-2/">next post</a>, I will
give a concrete example with a simple Gaussian Mixture Model.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, The Variational Approximation for Bayesian Inference, IEEE Signal Processing Magazine, Nov 2008 <a href="#fnref:1" class="reversefootnote">↩</a> <a href="#fnref:1:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:2">
<p>C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006 <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>R.M. Neal and G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse and other variants, Learning in Graphical Models, 1998 <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduStatistical inference involves finding the right model and parameters that represent the distribution of observations well. Let $\mathbf{x}$ be the observations and $\theta$ be the unknown parameters of a ML model. In maximum likelihood estimation, we try to find the $\theta_{ML}$ that maximizes the probability of the observations using the ML model with the parameters:Dirichlet Process Mixtures and Inference (Part 1)2016-12-27T13:14:52-08:002016-12-27T13:14:52-08:00https://chrischoy.github.io/research/Dirichlet-Process<figure>
<img src="https://chrischoy.github.io/images/research/dirichlet-process-mixtures.png" />
</figure>
<p>Statistical inference often requires modeling the distribution of data.
There are two branches of statistical modeling: parametric and non-parametric methods.
The former one specifies the data distribution using a family of distributions
with a finite number of parameters. In non-parametric methods, there is no
limit on the number of parameters, which makes the name <em>non-parametric</em> a bit
misleading.</p>
<p>One family of the non-parametric methods is well known and has been studied for
a long time for its application in clustering: the Dirichlet Process Mixtures.
Specifically, when you are dealing with a mixture model, the number of clusters
is left as a hyper-parameter that requires tuning. However, in the Dirichlet
Process Mixtures (DPM), we can also <em>infer</em> the number of clusters.
Before we discuss the DPM, we will cover the Dirichlet Process (DP).</p>
<h2 id="dirichlet-process">Dirichlet Process</h2>
<p>In short, the Dirichlet Process is a generalization of Dirichlet distributions
where a sample from the Dirichlet Process generates a Dirichlet distribution.
Since its sample is a distribution, we also call it a distribution over
distributions. Interestingly, the generalization allows the Dirichlet Process
to have an infinite number of components (or clusters). For this, the DP is a
non-parametric method, which means that there is no limit on the number of
parameters. In practice, however, due to practical limitations on
memory, computation, and time, we use the Truncated Dirichlet Process (TDP) during
inference, which puts a limit on a number of clusters and therefore parameters.
Also in real data, you will only get as many clusters as the number of data
points :)</p>
<p>In this post, we will only consider the definition and one particular sampling procedure of the DP. We
will cover the DPM and inference processes (inferring the number of clusters, as well as
the parameters of the clusters) in the following posts.</p>
<h3 id="dirichlet-distribution">Dirichlet Distribution</h3>
<p>First, let the Dirichlet distribution with parameter $\mathbf{a} \in
\mathbb{R}_{++}^K$ (K categories) be $Dir(\mathbf{a})$. If you haven’t picked up
a probability book in years and you are fuzzy on the details, you only need to know
that one of the properties of the Dirichlet distribution is that it is a
conjugate prior of a multinomial distribution and thus has the following
property. Let $A_k$ be the set of data from the $k$th category.</p>
$$
\begin{align}
P(\theta \in A) & \sim Dir(\mathbf{a}) \\
\boldsymbol{\theta} & = (\theta_1, ..., \theta_N) \quad N \text{ samples} \\
n_k & = |\{i : \theta_i \in A_k\}| \quad \text{number of samples in category } k \\
\mathbf{n} & = (n_1, ..., n_K) \\
P(\theta_{N+1} | \boldsymbol{\theta}) & \sim Dir(\mathbf{a} + \mathbf{n})
\end{align}
$$
<p>This is too much formality, but it will be helpful for the notation which I will use later.
Basically, it says that if you observe $N$ samples and $n_k$ of them fall into
the $k$th class, the posterior distribution after you observe the $N$ samples
will be skewed to favor classes with more samples and the contribution is
simply additive.</p>
<h3 id="definition-of-the-dirichlet-process">Definition of the Dirichlet Process</h3>
<p>The formal definition of the Dirichlet Process is similar to that of many
stochastic processes: the marginals of a distribution or a partition of a space follow a certain
distribution. Here, as the name indicates, the partition of a space follows the
Dirichlet distribution.</p>
<p>For example, let the space $A$ be a real number and a $K$-partition of the space
is equivalent to making $K$ clusters in the space (this clustering effect gives
us another way to generate samples by de Finetti’s theorem. I will not cover
this in this post). If I denote the Dirichlet Process as $G$, then</p>
<script type="math/tex; mode=display">(G(A_1), ..., G(A_K)) \sim Dir(a_1, ..., a_K)</script>
<p>Think of $G(A_i)$s as random variables that follow the Dirichlet distribution.
More formally, let the base distribution be $H$ and the concentration parameter
be $\alpha$.</p>
<script type="math/tex; mode=display">(G(A_1), ..., G(A_K)) \sim Dir(\alpha H(A_1), ..., \alpha H(A_K))</script>
<p>For example, let $H\sim \mathcal{N}(0, 1)$ and $\alpha$ be an arbitrary
positive number. In this case, since the base distribution is a Gaussian, we
will sample from the Gaussian when we sample a new cluster. If we sample
multiple distributions from the Dirichlet Process, the average of the whole
process will be the Gaussian, i.e.
$E[G(A)] = \mathcal{N}(0, 1)$. $\alpha$ also plays an interesting role.
The variance of the Dirichlet process will be
smaller as we choose a larger $\alpha$. If you are interested in more details,
please refer to a great tutorial by Teh et al. 2010 <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
<p>In light of all this, we can generate distributions from the Dirichlet process.</p>
<h3 id="samples-from-a-dirichlet-process">Samples from a Dirichlet Process</h3>
<p>From the above definition, and from the conjugacy of the Dirichlet
distribution, we can elicit a posterior distribution given $N$ observations
$\boldsymbol{\theta}$.</p>
$$
\begin{align}
(G(A_1), ..., G(A_K)) | \boldsymbol{\theta} & \sim Dir(\alpha H(A_1) + n_1, ..., \alpha H(A_K) + n_K)\\
G | \boldsymbol{\theta} & \sim Dir((\alpha + N)H'(A_1), ..., (\alpha + N) H'(A_K))
\end{align}
$$
<p>where $H’ = \frac{1}{\alpha +N} \left(\alpha H + \sum_{i=1}^N \delta_{\theta_i}\right)$.
The $\delta_{\theta_i}$ is a point mass on a sample $\theta_i$.
In sum, the posterior distribution will be a new Dirichlet process with
concentration parameter $\alpha + N$ and base distribution $H’$.</p>
<p>If we dissect the new base distribution, we can observe that</p>
<ul>
<li>With probability $\frac{\alpha}{\alpha + N}$, we sample from $H$.</li>
<li>With probability $\frac{n_k}{\alpha + N}$, we sample from $A_k$.</li>
</ul>
<h3 id="blackwell-macqueen-urn-scheme">Blackwell-MacQueen Urn Scheme</h3>
<p>We know from above that the posterior distribution follows the base distribution $H$
with a certain probability or we otherwise sample from the existing pool. We
can then generate a set of samples using the posterior distribution.</p>
<p>The sampling strategy that we just generated is called the Blackwell-MacQueen
Urn scheme where the space $A$ is the space of color and we are drawing colored balls.</p>
<p>In the following SVG, I implemented a simple Blackwell-MacQueen Urn Scheme in
javascript and d3.</p>
<p>I chose $\alpha = 5$ and a base distribution $H = \mathcal{N}(0, 1)$. To start,
press the start button. It starts sampling each point. Since the base
distribution follows the standard distribution, the expectation of all of the
processes is 0 and thus we will be able to see most samples around 0. To see
different samples from the Dirichlet Process, refresh the window and press
start.</p>
<div id="dp_gaussian">
</div>
<div style="text-align:center;">
<input id="start_dp_n_button" type="button" value="Start" onclick="start_dp_n()" />
</div>
<p>Another interesting property of the Dirichlet Process is the clustering effect.
To see the clustering effect more closely, I visualized each cluster according
to the cluster ID. If a new datum is sampled from a base distribution, not from
the existing clusters, then it will be given a new cluster ID.</p>
<div id="dp">
</div>
<div style="text-align:center;">
<input id="start_dp_button" type="button" value="Start" onclick="start_dp()" />
</div>
<p>You will see few dominant ‘rich’ clusters getting ‘richer’ and taking up major
portions of the data.</p>
<p>In the next post, I’ll go over the Dirichlet Process Mixtures.</p>
<script type="text/javascript">
/* global states */
var states = [], clusters = [], svg = [], svg_n = [];
var speed = 30, alpha=5, N=1000;
var margin = 0, width = 0, height = 0, height_s = 0, width_s = 0;
var min_x = 0, max_x = 0, x_width = 0;
window.onload = function run3D() {
var page_content = d3.select(".page__content");
var w = page_content.style('width');
var f_w = parseFloat(w), f_h = 0.6 * f_w;
var h = f_h + "px";
margin = {top: f_h * 0.05, right: f_w * 0.05, bottom: f_h * 0.1, left: f_w * 0.07};
width = f_w - margin.left - margin.right;
height = f_h - margin.top - margin.bottom;
/* Standard Normal variate using Box-Muller transform. */
function randn_bm() {
var u = 1 - Math.random();
var v = 1 - Math.random();
var s = Math.sqrt( -2.0 * Math.log( u ) );
return [s * Math.cos( 2.0 * Math.PI * v ),
s * Math.sin( 2.0 * Math.PI * v )];
}
/* Dirichlet Process */
var data = [], card_cluster = [];
for(var n=0; n < N; n++){
var cur_prob = Math.random() * (n + alpha);
/* http://www.cs.cmu.edu/~./kbe/dp_tutorial.pdf page 17
New class with prob alpha/(n + alpha) [0-based indexing]
Existing class k with prob num_k / (n + alpha) */
var cluster_id = 0;
if(cur_prob > n){
/* make a new cluster */
cluster_id = card_cluster.length;
card_cluster[cluster_id] = 0; /* will add after the if/else clause */
/* Sample Dirichlet Mixtures
Sigma is [f_w / 7, 0; 0, f_h / 7] Mean is [f_w / 2, f_h / 2].
TODO: Make it to scale as the window size changes. (Render again, f_w, f_h should be variables) */
clusters[cluster_id] = randn_bm()[0];
} else {
/* sample from a cluster */
var accum_card_cluster = 0, last_cluster = 0;
for(let num_per_cluster of card_cluster){
if(accum_card_cluster < cur_prob
&& cur_prob < accum_card_cluster + num_per_cluster){
cluster_id = last_cluster;
break;
}
accum_card_cluster += num_per_cluster;
last_cluster += 1;
}
}
/* once found, accumulate the current cluster */
card_cluster[cluster_id] += 1;
states[n] = [cluster_id, card_cluster[cluster_id]];
}
var num_cluster = card_cluster.length,
barWidth = Math.floor(width / num_cluster) - 1;
var color = d3.scaleLinear().domain([1, num_cluster])
.interpolate(d3.interpolateHcl)
.range([d3.rgb("#007AFF"), d3.rgb('#FFF500')]);
var max_card = d3.max(card_cluster);
height_s = height / max_card;
width_s = width / num_cluster;
/* Axes */
var x = d3.scaleLinear()
.range([barWidth / 2, width - barWidth / 2]);
var y = d3.scaleLinear()
.range([height, 0]);
/* Scale the range of the data */
x.domain([1, num_cluster]);
y.domain([0, max_card]);
/* An SVG element with a bottom-right origin. */
svg = d3.select("#dp").append("svg")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform", "translate(" + margin.left + "," + margin.top + ")");
/* Add the x Axis */
svg.append("g")
.attr("transform", "translate(0," + height + ")")
.call(d3.axisBottom(x));
/* Add the y Axis */
svg.append("g")
.call(d3.axisLeft(y));
/* text label for the x axis */
svg.append("text")
.attr("transform",
"translate(" + (width/2) + " ," + (height + margin.top + f_h * 0.03) + ")")
.style("text-anchor", "middle")
.text("Cluster ID");
/* text label for the y axis */
svg.append("text")
.attr("transform", "rotate(-90)")
.attr("y", 0 - margin.left)
.attr("x",0 - (height / 2))
.attr("dy", "1em")
.style("text-anchor", "middle")
.text("Number of samples");
var rect = svg.selectAll("rect")
.data(card_cluster)
.enter().append("rect")
.attr("class", function(d, i){return ('dp-' + i);})
.attr("width", 0.98 * width_s)
.attr("height", 0)
.attr("fill", function(d, i){return color(i);})
.attr("transform", function(d, i) { return "translate(" + i * width_s + "," + (height - height_s * d) + ")"; });
/* An SVG element with a bottom-right origin. */
svg_n = d3.select("#dp_gaussian").append("svg")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform", "translate(" + margin.left + "," + margin.top + ")");
min_x = d3.min(clusters);
max_x = d3.max(clusters);
x_width = max_x - min_x;
var x_n = d3.scaleLinear()
.range([barWidth / 2, width - barWidth / 2]);
var y_n = d3.scaleLinear()
.range([height, 0]);
x_n.domain([min_x, max_x]);
y_n.domain([0, max_card]);
svg_n.append("g")
.attr("transform", "translate(0," + height + ")")
.call(d3.axisBottom(x_n));
svg_n.append("g")
.call(d3.axisLeft(y_n));
svg_n.append("text")
.attr("transform",
"translate(" + (width/2) + " ," + (height + margin.top + f_h * 0.03) + ")")
.style("text-anchor", "middle")
.text("x");
svg_n.append("text")
.attr("transform", "rotate(-90)")
.attr("y", 0 - margin.left)
.attr("x",0 - (height / 2))
.attr("dy", "1em")
.style("text-anchor", "middle")
.text("Number of samples");
var rect = svg_n.selectAll("rect")
.data(card_cluster)
.enter().append("rect")
.attr("class", function(d, i){return ('dp-n-' + i);})
.attr("width", 0.98 * width_s)
.attr("height", 0)
.attr("fill", function(d, i){return color(i);})
.attr("transform", function(d, i) { return "translate(" + width * (clusters[i] - min_x) / x_width + "," + (height - height_s * d) + ")"; });
}; /* Onload */
function start_dp() {
d3.select("#start_dp_button").attr('disabled', true);
increment(0);
}
function increment(step) {
if(step >= states.length){
return 0;
}
cluster_id = states[step][0];
cardinality = states[step][1];
var cur_rect = svg.select('.dp-'+states[step][0]);
d3.select('.dp-'+states[step][0])
.transition()
.duration(speed)
.attr("height", height_s * cardinality)
.attr("transform", function(d) { return "translate(" + cluster_id * width_s + "," + (height - height_s * cardinality) + ")"; })
.on("end", () => increment(++step));
}
function start_dp_n() {
d3.select("#start_dp_n_button").attr('disabled', true);
increment_n(0);
}
function increment_n(step) {
if(step >= states.length){
return 0;
}
cluster_id = states[step][0];
cardinality = states[step][1];
var cur_rect = svg.select('.dp-n-'+states[step][0]);
d3.select('.dp-n-'+states[step][0])
.transition()
.duration(speed)
.attr("height", height_s * cardinality)
.attr("transform", function(d) { return "translate(" + width * (clusters[cluster_id] - min_x) / x_width + "," + (height - height_s * cardinality) + ")"; })
.on("end", () => increment_n(++step));
}
</script>
<h2 id="edits">Edits</h2>
<ul>
<li>2017/Mar/1 Fixed MathJax \mathbf{\theta} rendering problem, correct grammatical errors</li>
</ul>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf">Teh et al., Dirichlet Process, 2010</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.edu