Jekyll2017-09-18T11:03:24-07:00https://chrischoy.github.io/Computer Vision'er'Chris Choychrischoy@ai.stanford.eduDeformNet: Free-Form Deformation Network for 3D Shape Reconstruction from a Single Image2017-08-18T01:18:24-07:002017-08-18T01:18:24-07:00https://chrischoy.github.io/preprint/deformnet<h2 id="abstract">Abstract</h2>
<p>3D reconstruction from a single image is a key problem in multiple applications ranging from robotic manipulation to augmented reality. Prior methods have tackled this problem through generative models which predict 3D reconstructions as voxels or point clouds. However, these methods can be computationally expensive and miss fine shape details. We introduce a new differentiable layer for 3D data deformation and use it in DeformNet to learn free-form deformations usable on multiple 3D data formats. DeformNet takes an image input, searches the nearest shape template from the database, and deforms the template to match the query image. We evaluate our approach on the ShapeNet database and show that - (a) Free-Form Deformation is a powerful new building block for Deep Learning models that manipulate 3D data (b) DeformNet uses this FFD layer combined with shape retrieval for smooth and detail-preserving 3D reconstruction of qualitatively plausible point clouds with respect to a single query image (c) compared to other state-of-the-art 3D reconstruction methods, DeformNet quantitatively matches or outperforms their benchmarks by significant margins.</p>
<ul>
<li><a href="https://deformnet-site.github.io/DeformNet-website/">Project page</a></li>
<li><a href="https://arxiv.org/abs/1708.04672">ArXiv</a></li>
</ul>Andrey KurenkovAbstractWeakly Supervised 3D Reconstruction with Manifold Constraint2017-06-01T23:39:17-07:002017-06-01T23:39:17-07:00https://chrischoy.github.io/preprint/weakly-supervised-reconstruction<h2 id="abstract">Abstract</h2>
<p>Volumetric 3D reconstruction has witnessed a significant progress in performance through the use of deep neural network based methods that address some of the limitations of traditional reconstruction algorithms. However, this increase in performance requires large scale annotations of 2D/3D data. This paper introduces a novel generative model for volumetric 3D reconstruction, Weakly supervised Generative Adversarial Network (WS-GAN) which reduces reliance on expensive 3D supervision. WS-GAN takes an input image, a sparse set of 2D object masks with respective camera parameters, and an unmatched 3D model as inputs during training. WS-GAN uses a learned encoding as input to a conditional 3D-model generator trained alongside a discriminator, which is constrained to the manifold of realistic 3D shapes. We bridge the representation gap between 2D masks and 3D volumes through a perspective raytrace pooling layer, that enables perspective projection and allows backpropagation. We evaluate WS-GAN on ShapeNet, ObjectNet and Stanford Online Product dataset for reconstruction with single-view and multi-view cases in both synthetic and real images. We compare our method to voxel carving and prior work with full 3D supervision. Additionally, we also demonstrate that the learned feature representation is semantically meaningful through interpolation and manipulation in input space.</p>
<ul>
<li><a href="https://arxiv.org/abs/1705.10904">ArXiv</a></li>
</ul>Christopher B. Choy*AbstractExpectation Maximization and Variational Inference (Part 2)2017-03-23T09:05:51-07:002017-03-23T09:05:51-07:00https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference-2<p>In the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference/">previous post</a>, we covered
variational inference and how to derive update equations. In this post, we will
go over a simple Gaussian Mixture Model with the Dirichlet prior distribution
over the mixture weight.</p>
<p>Let $x_n$ be a datum and $z_n$ be the latent variable that indicates the
assignment of the datum $x_n$ to a cluster $k$, $z_{nk} = I(z_n = k)$. We
denote the weight of a cluster $k$ with $\pi_k$ and the natural parameter of
the cluster as $\eta_k$.</p>
<p>The graphical model of the mixtures looks like the following.</p>
<figure>
<img style="width:30%" class="align-center" src="https://chrischoy.github.io/images/research/graphical_model.png" />
</figure>
<p>Formally, we define the generative process
$p(\pi|\alpha), p(z_n; \pi_0), p(x_n | z_z, \eta)$.
Unlike Bishop <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> and Blei et al. <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>, we will not use prior over the natural
parameter $\eta$ for simplicity. The notation and the model are similar to that
used in Blei et al. <sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup>. With overloading notation,</p>
$$
\begin{align}
p(\pi | \alpha_0) & = \mathrm{Dir}(\pi; \alpha_0) \\
p(z_n | \pi) & = \prod_k \pi_k^{z_{nk}} \\
p(x_n | z_n, \eta) & = \prod_k \mathcal{N}(x_n ; \eta_k)^{z_{nk}}
\end{align}
$$
<p>And the log joint probability is</p>
$$
\log p(\mathbf{x}, \mathbf{z} ; \eta, \alpha_0) = \sum_n \sum_k z_{nk} [\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \log \mathrm{Dir}(\pi; \alpha_0)
$$
<h2 id="meanfield-approximation">Meanfield Approximation</h2>
<p>In this example, let’s use the meanfield approximation and make the posterior
distribution of the latent variables $z$ and $\pi$ independent. i.e.</p>
$$
q(z, \pi) = q(z)q(\pi)
$$
<p>From the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference/">previous post</a>, we know that
the optimal distribution $q(\cdot)$ that maximizes the evidence lower bound
is</p>
$$
\log q(w_i) = \mathbb{E}_{w_{j}, j\neq i} \log p(x, \mathbf{w})
$$
<p>where $w_i$ is an arbitrary latent variable. Thus, we can use the same
technique and find $q(z)$ and $q(\pi)$.</p>
$$
\begin{align*}
\log q(z) & = \sum_n \sum_k z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \mathbb{E}\log \mathrm{Dir}(\pi; \alpha_0) \\
& = \sum_n \sum_k z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + C_1 \\
\log q(\pi) & = \sum_n \sum_k \mathbb{E}z_{nk} [\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \log \mathrm{Dir}(\pi; \alpha_0) \\
& = \sum_n \sum_k \mathbb{E}z_{nk} \log \pi_k + \log \mathrm{Dir}(\pi; \alpha_0) + C_2
\end{align*}
$$
<p>We can easily compute the expectations of the latent variables.</p>
$$
\begin{align*}
\mathbb{E}\log \pi_k & = \psi(\alpha_k) - \psi(\sum_k \alpha_k) = \log \tilde{\pi}_k \\
\mathbb{E}z_{nk} & = q(z_{nk}=1) \propto \exp\left\{\log \tilde{\pi}_k + \log \mathcal{N}(x_n; \eta_k)\right\} = \rho_{nk} \\
\mathbb{E}z_{nk} & = \frac{\rho_{nk}}{\sum_l \rho_{nl}} = r_{nk}
\end{align*}
$$
<p>where $\alpha_k$ are the parameters of the latent variable $\pi_k$ and $\psi$
is the digamma function. We get the first equation from the property of the
Dirichlet distribution. Given the expectations, we can simplify the equations
and get update rules.</p>
<h2 id="expectation-and-maximization">Expectation and Maximization</h2>
<p>First, let’s examine the $\log q(\pi)$.</p>
$$
\begin{align*}
\log q(\pi) & = \sum_n \sum_k r_{nk} \log \pi_k + \log \mathrm{Dir}(\pi; \alpha_0) + C_2 \\
& = \sum_n \sum_k r_{nk} \log \pi_k + (\alpha_0 - 1) \log \pi_k + C_3 \\
& = \sum_k (\alpha_0 + \sum_n r_{nk} - 1) \log \pi_k + C_3 \\
& = \log \mathrm{Dir}(\pi| \alpha)
\end{align*}
$$
<p>Thus, $\alpha_k = \alpha_0 + \sum_n r_{nk}$. The $z$ update equation is given
above. Finally, for $\eta$, we differentiate $p(x;\eta)$ with respect to
$\eta$ to find the update rule.</p>
$$
\begin{align*}
\log p(x; \eta) & = \mathop{\mathbb{E}}_{z, \pi} \log p(x, z, \pi; \eta) \\
& = \sum_n \sum_k \mathbb{E} z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \mathbb{E}\log \mathrm{Dir}(\pi; \alpha_0) \\
\nabla_{\eta_k} \log p(x; \eta) & = \sum_n r_{nk} \nabla_{\eta_k} \log \mathcal{N}(x_n ; \eta_k) \\
& = \sum_n r_{nk} \nabla_{\eta_k} \left( \frac{1}{2} \log |\Lambda_k| - \frac{1}{2} \mathrm{Tr}\left(\Lambda_k (x_n - \mu_n)(x_n - \mu_n)^T \right) \right) \\
\nabla_{\mu_k} \log p(x; \eta) & = \sum_n r_{nk} \Lambda_k (x_n - \mu_n) = 0 \\
\nabla_{\Lambda_k} \log p(x; \eta) & = \frac{1}{2} \sum_n r_{nk} \nabla_{\Lambda_k} \log |\Lambda_k| - r_{nk} \nabla_{\Lambda_k} \mathrm{Tr}\left(\Lambda_k (x_n - \mu_n)(x_n - \mu_n)^T \right) \\
& = \frac{1}{2} \sum_n r_{nk} \Lambda_k^{-1} - r_{nk} (x_n - \mu_n)(x_n - \mu_n)^T = 0 \\
\end{align*}
$$
<p>From the above equations, we can get</p>
$$
\begin{align}
N_k & = \sum_n r_{nk} \\
\mu_k & = \frac{1}{N_k} \sum_n r_{nk} x_n \\
\Lambda_k & = \frac{1}{N_k} \sum_n r_{nk} (x_n - \mu_k)(x_n - \mu_k)^T
\end{align}
$$
<h2 id="evidence-lower-bound">Evidence Lower Bound</h2>
<p>Given the final solutions $r_{nk}$, $\log \tilde{\pi}_k$, $\alpha’$, we can
derive the negative of the variational free energy, or the Evidence Lower Bound (ELBO).</p>
$$
\begin{align*}
ELBO & = \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x, z, \pi)}{q(z, \pi)} \\
& = \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x | z) p(z| \pi) p(\pi)}{q(z)q(\pi)} - \mathbb{E}_z\mathbb{E}_z \log q(z)q(\pi) \\
& = \underbrace{\mathbb{E}_z \log p(x | z)}_{\mbox{(a)}}
+ \underbrace{\mathbb{E}_z \mathbb{E}_\pi \log p(z | \pi) p(\pi) }_{\mbox{(b)}}
+ \underbrace{H(q(z))}_{\mbox{(c)}}
+ \underbrace{H(q(\pi))}_{\mbox{(d)}}
\end{align*}
$$
<p>where $H(\cdot)$ is the entropy. Each of the terms can be computed</p>
$$
\begin{align*}
\mbox{(a)} & = \mathbb{E}_z \log p(x | z) \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \sum_k z_{nk} \log \mathcal{N}_k(x_n) \\
& = \sum_n \sum_k r_{nk} \log \mathcal{N}_k(x_n) \\
\mbox{(b)} & = \mathbb{E}_z \mathbb{E}_\pi \log p(z | \pi) p(\pi) \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \log \frac{1}{B(\mathbb{\alpha}_0)} \prod_k \pi_k^{z_{nk}} \pi_k^{\alpha_0 - 1} \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \sum_k (z_{nk} + \alpha_0 - 1) \log \pi_k - \log B(\mathbb{\alpha}_0) \\
& = \sum_n \sum_k (\mathbb{E}_z z_{nk} + \alpha_0 - 1) \mathbb{E}_\pi \log \pi_k - \log B(\mathbb{\alpha}_0) \\
& = \sum_k \left( \sum_n r_{nk} + \alpha_0 - 1 \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) \\
\mbox{(c)} & = - \mathbb{E}_z \log q(z) \\
& = - \mathbb{E}_z \sum_n \sum_k z_{nk} \log r_{nk} \\
& = - \sum_n \sum_k r_{nk} \log r_{nk} \\
\mbox{(d)} & = - \mathbb{E}_\pi \log q(\pi) \\
& = - \mathbb{E}_\pi \log \frac{1}{B(\mathbb{\alpha}')} \prod_k \pi_k^{\alpha'_k - 1} \\
& = - \sum_k (\alpha'_k - 1) \log \mathbb{E}_\pi \pi_k + \log B(\mathbb{\alpha}') \\
& = - \sum_k (\alpha'_k - 1) \log \tilde{\pi}_k + \log B(\mathbb{\alpha}')
\end{align*}
$$
<p>Since $\log r_{nk} = \log \tilde{\pi}_k + \log \mathcal{N}_k(x_n) - \log \left( \sum_l \exp \{\log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right)$,</p>
$$
\begin{align*}
\mbox{(a) + (c)} & = \sum_n \sum_k r_{nk} \left(\log \mathcal{N_k}(x_n) - \log r_{nk} \right) \\
& = \sum_n \sum_k r_{nk} \left(- \log \tilde{\pi}_k + \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \right)\\
& = - \sum_k N_k \log \tilde{\pi}_k + \sum_n \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \\
\mbox{(b) + (d)} & = \sum_k \left( \sum_n r_{nk} + \alpha_0 - 1 \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) \\
& - \sum_k (\alpha'_k - 1) \log \tilde{\pi}_k + \log B(\mathbb{\alpha}') \\
& = \sum_k \left( \sum_n r_{nk} + \alpha_0 - \alpha'_k \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) + \log B(\mathbb{\alpha}') \\
& = \log B(\mathbb{\alpha}') - \log B(\mathbb{\alpha}_0)
\end{align*}
$$
<p>Thus,</p>
$$
\begin{align*}
ELBO = & \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x, z, \pi)}{q(z, \pi)} \\
= & - \sum_k N_k \log \tilde{\pi}_k + \sum_n \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \\
& + \log B(\mathbb{\alpha}') - \log B(\mathbb{\alpha}_0) \\
\end{align*}
$$
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Blei, <a href="http://www.cs.columbia.edu/~blei/papers/BleiJordan2004.pdf">Variational Inference for Dirichlet Process Mixtures, Bayesian Analysis 2006</a> <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduScene Graph Generation by Iterative Message Passing2017-03-14T02:22:45-07:002017-03-14T02:22:45-07:00https://chrischoy.github.io/publication/scene-graph<h2 id="abstract">Abstract</h2>
<p>Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel end-to-end model that generates such structured scene representation from an input image. The model solves the scene graph inference problem using standard RNNs and learns to iteratively improves its predictions via message passing. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods on generating scene graphs using Visual Genome dataset and inferring support relations with NYU Depth v2 dataset.</p>
<ul>
<li><a href="https://arxiv.org/abs/1701.02426">ArXiv</a></li>
</ul>Danfei XuAbstractDESIRE: Deep Stochastic IOC RNN Encoder-decoder for Distant Future Prediction in Dynamic Scenes with Multiple Interacting Agents2017-03-14T02:22:45-07:002017-03-14T02:22:45-07:00https://chrischoy.github.io/publication/desire<h2 id="abstract">Abstract</h2>
<p>We introduce a Deep Stochastic IOC1 RNN Encoder- decoder framework, DESIRE, with a conditional Variational Auto-Encoder and multiple RNNs for the task of future predictions of multiple interacting agents in dynamic scenes. Accurately predicting the location of objects in the future is an extremely challenging task. An effective prediction model must be able to 1) account for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) fore-see the potential future outcomes and make a strategic prediction based on that, and 3) reason not only from the past motion history, but also from the scene context as well as the interactions among the agents.
DESIRE can address all aforementioned challenges in a single end-to-end trainable neural network model, while being computationally efficient. The model first obtains a diverse set of hypothetical future prediction samples employing a conditional variational auto-encoder, which are ranked and refined via the following RNN scoring-regression module. We evaluate our model on two publicly available datasets: KITTI and Stanford Drone Dataset. Our experiments show that the proposed model significantly improves the prediction accuracy compared to other baseline methods.</p>Namhoon LeeAbstractExpectation Maximization and Variational Inference (Part 1)2017-02-26T02:32:56-08:002017-02-26T02:32:56-08:00https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference<p>Statistical inference involves finding the right model and parameters that represent
the distribution of observations well. Let $\mathbf{x}$ be the observations and
$\theta$ be the unknown parameters of a ML model. In maximum likelihood
estimation, we try to find the $\theta_{ML}$ that maximizes the probability of
the observations using the ML model with the parameters:</p>
$$
\hat{\theta}_{ML} = \underset{\theta}{\arg\!\max} \; p(\mathbf{x}; \theta)
$$
<p>Typically, the problem requires few assumptions to solve the above optimization
efficiently. One trick is to introduce latent variables $\mathbf{z}$ that
break down the problem into smaller subproblems. For instance, in the <a href="https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model">Gaussian
Mixture
Model</a>, we
can introduce the cluster membership assignment as random variables $z_i$ for
each datum $x_i$, which greatly simplifies the model ($p(x_i | z_i=k) \sim
\mathcal{N}(\mu_k, \sigma_k)$).</p>
$$
p(\mathbf{x};\theta) = \int p(\mathbf{x}, \mathbf{z}; \theta) d\mathbf{z}
$$
<p>However, the above integration is, in many cases, intractable and can be either
approximated using stochastic sampling (Monte Carlo methods) or we can simply bypass
the computation using few assumptions. The second method is called
variational inference, coined after the <a href="https://en.wikipedia.org/wiki/Calculus_of_variations">calculus of
variations</a>, which we will
go over in this post.</p>
<h2 id="evidence-lower-bound-elbo">Evidence Lower Bound (ELBO)</h2>
<p>There are many great tutorials for variational inference, but I found the
tutorial by Tzikas et al.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> to be the most helpful. It follows the steps of
Bishop et al.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> and Neal et al.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> and starts the introduction by formulating
the inference as the Expectation Maximization. Here, we will summarize the steps in Tzikas
et al.<sup id="fnref:1:1"><a href="#fn:1" class="footnote">1</a></sup> and elaborate some steps missing in the paper. Let $q(z)$ be a
probability distribution on $z$. Then,</p>
$$
\begin{align*}
\ln p(x; \theta) & = \int q(z) \ln p(x; \theta) dz \\
& = \int q(z) \ln \Big( \frac{p(x; \theta) p(z | x; \theta)}{p(z | x; \theta)} \Big) dz \\
& = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{p(z | x; \theta)} \Big) dz \\
& = \int q(z) \ln \Big( \frac{p(x, z; \theta) q(z)}{p(z | x; \theta) q(z)} \Big) dz \\
& = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz
- \int q(z) \ln \Big( \frac{p(z | x; \theta)}{q(z)} \Big) dz \\
& = F(q, \theta) + KL(q || p)
\end{align*}
$$
<p>where $F(q, \theta)$ known as the evidence lower bound or ELBO, or the negative
of the variational free energy. $KL(\cdot || \cdot)$ is the Kullback-Leibler
divergence. Since the KL-divergence is non-negative,</p>
$$
\ln p(x; \theta) \ge F(q, \theta)
$$
<p>The ELBO provides a lower bound for the marginal likelihood. Instead of
maximizing the marginal likelihood directly, the Expectation
Maximization (EM) and variational inference maximize the variational lower bound.</p>
<h2 id="expectation-maximization">Expectation Maximization</h2>
<p>Let’s assume that we can find $p(z | x; \theta^{OLD})$ analytically (for
the Gaussian Mixture Model, this is just a softmax). Then, we can simply
substitute $q(z) = p(z | x; \theta^{OLD})$. The ELBO becomes</p>
$$
\begin{align*}
F(q, \theta) & = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz \\
& = \int q(z) \ln p(x, z; \theta) dz - \int q(z) \ln q(z) dz \\
& = \int p(z | x; \theta^{OLD}) \ln p(x, z; \theta) dz \\
& \quad - \int p(z | x; \theta^{OLD}) \ln p(z | x; \theta^{OLD}) dz \\
& = Q(\theta, \theta^{OLD}) + H(q)
\end{align*}
$$
<p>The second term $H(z|x)$ is the entropy of $z$ given $x$ and is a function of
$\theta^{OLD}$. It is a constant with respect to $\theta$ and we do not take
the term into account while maximizing the ELBO.</p>
<p>The EM algorithm can be succinctly summarized as ${\arg\!\max}_\theta Q(\theta,
\theta^{OLD})$.</p>
<ul>
<li>E-step: compute $p(z | x; \theta^{OLD})$</li>
<li>M-step: evaluate ${\arg\!\max}_\theta \int p(z | x; \theta^{OLD}) \ln p(x, z; \theta) dz$</li>
</ul>
<p>For example, the EM for the Gaussian Mixture Model consists of an expectation step
where you compute the soft assignment of each datum to K clusters, and a maximization
step which computes the parameters of each cluster using the assignment.
However, for complex models, we cannot use the EM algorithm.</p>
<h2 id="variational-expectation-maximization">Variational Expectation Maximization</h2>
<p>For a simple model, an analytical solution for $p(z | x; \theta^{OLD})$ exists and
thus computing $q(z) = p(z | x; \theta^{OLD})$ is tractable.
However, it is not possible in general as the model gets more complex.
Instead, we approximate the posterior probability using a simpler model. For
example, we assume that a set of latent variables is independent of the rest of
the latent variables given $x$. Such independence reduces complexity and allows
us to deduce the analytic form of the EM.</p>
<p>We can even enforce full independence among all latent variables given $x$,
i.e., $z_i,\perp z_j$ for $i \neq j$. This assumption, known as the mean field
approximation, allows us to compute the update rules for each latent variable
in isolation and has been successful in many problems. We will go over
variational inference using the mean field approximation, but the following
technique can be used for models with more complex dependency.</p>
<p>Let $q(z) = \prod_i q(z_i)$. Then, the ELBO can be
factorized into $z_j$ and the rest of the latent variables.</p>
$$
\begin{align*}
F(q, \theta) & = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz \\
& = \int \prod_i q(z_i) \ln p(x, z; \theta) dz - \sum_i \int q(z_i) \ln q(z_i) dz_i \\
& = \int q(z_j) \int \Big( \prod_{i \neq j} q(z_i) \ln p(x, z; \theta) \Big) \prod_{i \neq j} dz_i dz_j \\
& \quad - \int q(z_j) \ln q(z_j) dz_j - \sum_{i \neq j} \int q(z_i) \ln q(z_i) dz_i \\
& = \int q(z_j) \ln \Big( \frac{\exp(\langle \ln p(x, z; \theta)\rangle_{i \neq j})}{q(z_j)} \Big) dz_j \\
& \quad - \sum_{i \neq j} \int q(z_i) \ln q(z_i) dz_i \\
& = \int q(z_j) \ln \Big( \frac{\tilde{p}_{i\neq j}}{q(z_j)} \Big) dz_j + H(z_{i\neq j}) + c\\
& = - KL(q_j || \tilde{p}_{i\neq j}) + H(z_{i\neq j}) + c
\end{align*}
$$
<p>where $\langle \cdot \rangle_i$ indicates the expectation over the latent
variable $z_i$. Since $\exp(\langle \ln p(x, z; \theta)\rangle_{i \neq j})$ is
not a proper pdf, the constant $c$ is added to adjust it to become a proper
pdf. Since the KL-divergence is non-negative, the ELBO
is maximized when $KL(\cdot || \cdot) = 0$ which happens when $q(z_j) =
\tilde{p}_{i\neq j} = \frac{1}{Z} \exp \langle \ln p(x, z; \theta)\rangle_{i
\neq j}$.</p>
<p>Similarly, in the variation EM,</p>
<ul>
<li>E-step: evaluate $q^*(z_j) = \frac{1}{Z} \exp \langle \ln p(x, z;
\theta)\rangle_{i \neq j}$ for all $j$,
<ul>
<li>$q^{NEW} = \prod_i q_i^*$</li>
</ul>
</li>
<li>M-step: find $\theta = {\arg\!\max}_\theta F(q^{NEW}, \theta)$</li>
</ul>
<p>In practice, $q^*$ is the optimal probability that maximizes $F(q, \theta)$.
And $q^*$ has the form of known probability distribution functions. Thus,
$\theta^{NEW}$ would simply be the parameters of the probability distribution
function after factorizing the other probability distribution. However, if the
function $q^*$ cannot be simplified into a known form, solving the KKT
condition and setting the derivative of the ELBO would give you a
solution.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The variational EM gives us a way to bypass computing the partition function and
allows us to infer the parameters of a complex model using a deterministic
optimization step. In the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference-2/">next post</a>, I will
give a concrete example with a simple Gaussian Mixture Model.</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, The Variational Approximation for Bayesian Inference, IEEE Signal Processing Magazine, Nov 2008 <a href="#fnref:1" class="reversefootnote">↩</a> <a href="#fnref:1:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:2">
<p>C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006 <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>R.M. Neal and G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse and other variants, Learning in Graphical Models, 1998 <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduDirichlet Process Mixtures and Inference (Part 1)2016-12-27T13:14:52-08:002016-12-27T13:14:52-08:00https://chrischoy.github.io/research/Dirichlet-Process<figure>
<img src="https://chrischoy.github.io/images/research/dirichlet-process-mixtures.png" />
</figure>
<p>Statistical inference often requires modeling the distribution of data.
There are two branches of statistical modeling: parametric and non-parametric methods.
The former one specifies the data distribution using a family of distributions
with a finite number of parameters. In non-parametric methods, there is no
limit on the number of parameters, which makes the name <em>non-parametric</em> a bit
misleading.</p>
<p>One family of the non-parametric methods is well known and has been studied for
a long time for its application in clustering: the Dirichlet Process Mixtures.
Specifically, when you are dealing with a mixture model, the number of clusters
is left as a hyper-parameter that requires tuning. However, in the Dirichlet
Process Mixtures (DPM), we can also <em>infer</em> the number of clusters.
Before we discuss the DPM, we will cover the Dirichlet Process (DP).</p>
<h2 id="dirichlet-process">Dirichlet Process</h2>
<p>In short, the Dirichlet Process is a generalization of Dirichlet distributions
where a sample from the Dirichlet Process generates a Dirichlet distribution.
Since its sample is a distribution, we also call it a distribution over
distributions. Interestingly, the generalization allows the Dirichlet Process
to have an infinite number of components (or clusters). For this, the DP is a
non-parametric method, which means that there is no limit on the number of
parameters. In practice, however, due to practical limitations on
memory, computation, and time, we use the Truncated Dirichlet Process (TDP) during
inference, which puts a limit on a number of clusters and therefore parameters.
Also in real data, you will only get as many clusters as the number of data
points :)</p>
<p>In this post, we will only consider the definition and one particular sampling procedure of the DP. We
will cover the DPM and inference processes (inferring the number of clusters, as well as
the parameters of the clusters) in the following posts.</p>
<h3 id="dirichlet-distribution">Dirichlet Distribution</h3>
<p>First, let the Dirichlet distribution with parameter $\mathbf{a} \in
\mathbb{R}_{++}^K$ (K categories) be $Dir(\mathbf{a})$. If you haven’t picked up
a probability book in years and you are fuzzy on the details, you only need to know
that one of the properties of the Dirichlet distribution is that it is a
conjugate prior of a multinomial distribution and thus has the following
property. Let $A_k$ be the set of data from the $k$th category.</p>
$$
\begin{align}
P(\theta \in A) & \sim Dir(\mathbf{a}) \\
\boldsymbol{\theta} & = (\theta_1, ..., \theta_N) \quad N \text{ samples} \\
n_k & = |\{i : \theta_i \in A_k\}| \quad \text{number of samples in category } k \\
\mathbf{n} & = (n_1, ..., n_K) \\
P(\theta_{N+1} | \boldsymbol{\theta}) & \sim Dir(\mathbf{a} + \mathbf{n})
\end{align}
$$
<p>This is too much formality, but it will be helpful for the notation which I will use later.
Basically, it says that if you observe $N$ samples and $n_k$ of them fall into
the $k$th class, the posterior distribution after you observe the $N$ samples
will be skewed to favor classes with more samples and the contribution is
simply additive.</p>
<h3 id="definition-of-the-dirichlet-process">Definition of the Dirichlet Process</h3>
<p>The formal definition of the Dirichlet Process is similar to that of many
stochastic processes: the marginals of a distribution or a partition of a space follow a certain
distribution. Here, as the name indicates, the partition of a space follows the
Dirichlet distribution.</p>
<p>For example, let the space $A$ be a real number and a $K$-partition of the space
is equivalent to making $K$ clusters in the space (this clustering effect gives
us another way to generate samples by de Finetti’s theorem. I will not cover
this in this post). If I denote the Dirichlet Process as $G$, then</p>
<script type="math/tex; mode=display">(G(A_1), ..., G(A_K)) \sim Dir(a_1, ..., a_K)</script>
<p>Think of $G(A_i)$s as random variables that follow the Dirichlet distribution.
More formally, let the base distribution be $H$ and the concentration parameter
be $\alpha$.</p>
<script type="math/tex; mode=display">(G(A_1), ..., G(A_K)) \sim Dir(\alpha H(A_1), ..., \alpha H(A_K))</script>
<p>For example, let $H\sim \mathcal{N}(0, 1)$ and $\alpha$ be an arbitrary
positive number. In this case, since the base distribution is a Gaussian, we
will sample from the Gaussian when we sample a new cluster. If we sample
multiple distributions from the Dirichlet Process, the average of the whole
process will be the Gaussian, i.e.
$E[G(A)] = \mathcal{N}(0, 1)$. $\alpha$ also plays an interesting role.
The variance of the Dirichlet process will be
smaller as we choose a larger $\alpha$. If you are interested in more details,
please refer to a great tutorial by Teh et al. 2010 <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
<p>In light of all this, we can generate distributions from the Dirichlet process.</p>
<h3 id="samples-from-a-dirichlet-process">Samples from a Dirichlet Process</h3>
<p>From the above definition, and from the conjugacy of the Dirichlet
distribution, we can elicit a posterior distribution given $N$ observations
$\boldsymbol{\theta}$.</p>
$$
\begin{align}
(G(A_1), ..., G(A_K)) | \boldsymbol{\theta} & \sim Dir(\alpha H(A_1) + n_1, ..., \alpha H(A_K) + n_K)\\
G | \boldsymbol{\theta} & \sim Dir((\alpha + N)H'(A_1), ..., (\alpha + N) H'(A_K))
\end{align}
$$
<p>where $H’ = \frac{1}{\alpha +N} \left(\alpha H + \sum_{i=1}^N \delta_{\theta_i}\right)$.
The $\delta_{\theta_i}$ is a point mass on a sample $\theta_i$.
In sum, the posterior distribution will be a new Dirichlet process with
concentration parameter $\alpha + N$ and base distribution $H’$.</p>
<p>If we dissect the new base distribution, we can observe that</p>
<ul>
<li>With probability $\frac{\alpha}{\alpha + N}$, we sample from $H$.</li>
<li>With probability $\frac{n_k}{\alpha + N}$, we sample from $A_k$.</li>
</ul>
<h3 id="blackwell-macqueen-urn-scheme">Blackwell-MacQueen Urn Scheme</h3>
<p>We know from above that the posterior distribution follows the base distribution $H$
with a certain probability or we otherwise sample from the existing pool. We
can then generate a set of samples using the posterior distribution.</p>
<p>The sampling strategy that we just generated is called the Blackwell-MacQueen
Urn scheme where the space $A$ is the space of color and we are drawing colored balls.</p>
<p>In the following SVG, I implemented a simple Blackwell-MacQueen Urn Scheme in
javascript and d3.</p>
<p>I chose $\alpha = 5$ and a base distribution $H = \mathcal{N}(0, 1)$. To start,
press the start button. It starts sampling each point. Since the base
distribution follows the standard distribution, the expectation of all of the
processes is 0 and thus we will be able to see most samples around 0. To see
different samples from the Dirichlet Process, refresh the window and press
start.</p>
<div id="dp_gaussian">
</div>
<div style="text-align:center;">
<input id="start_dp_n_button" type="button" value="Start" onclick="start_dp_n()" />
</div>
<p>Another interesting property of the Dirichlet Process is the clustering effect.
To see the clustering effect more closely, I visualized each cluster according
to the cluster ID. If a new datum is sampled from a base distribution, not from
the existing clusters, then it will be given a new cluster ID.</p>
<div id="dp">
</div>
<div style="text-align:center;">
<input id="start_dp_button" type="button" value="Start" onclick="start_dp()" />
</div>
<p>You will see few dominant ‘rich’ clusters getting ‘richer’ and taking up major
portions of the data.</p>
<p>In the next post, I’ll go over the Dirichlet Process Mixtures.</p>
<script type="text/javascript">
/* global states */
var states = [], clusters = [], svg = [], svg_n = [];
var speed = 30, alpha=5, N=1000;
var margin = 0, width = 0, height = 0, height_s = 0, width_s = 0;
var min_x = 0, max_x = 0, x_width = 0;
window.onload = function run3D() {
var page_content = d3.select(".page__content");
var w = page_content.style('width');
var f_w = parseFloat(w), f_h = 0.6 * f_w;
var h = f_h + "px";
margin = {top: f_h * 0.05, right: f_w * 0.05, bottom: f_h * 0.1, left: f_w * 0.07};
width = f_w - margin.left - margin.right;
height = f_h - margin.top - margin.bottom;
/* Standard Normal variate using Box-Muller transform. */
function randn_bm() {
var u = 1 - Math.random();
var v = 1 - Math.random();
var s = Math.sqrt( -2.0 * Math.log( u ) );
return [s * Math.cos( 2.0 * Math.PI * v ),
s * Math.sin( 2.0 * Math.PI * v )];
}
/* Dirichlet Process */
var data = [], card_cluster = [];
for(var n=0; n < N; n++){
var cur_prob = Math.random() * (n + alpha);
/* http://www.cs.cmu.edu/~./kbe/dp_tutorial.pdf page 17
New class with prob alpha/(n + alpha) [0-based indexing]
Existing class k with prob num_k / (n + alpha) */
var cluster_id = 0;
if(cur_prob > n){
/* make a new cluster */
cluster_id = card_cluster.length;
card_cluster[cluster_id] = 0; /* will add after the if/else clause */
/* Sample Dirichlet Mixtures
Sigma is [f_w / 7, 0; 0, f_h / 7] Mean is [f_w / 2, f_h / 2].
TODO: Make it to scale as the window size changes. (Render again, f_w, f_h should be variables) */
clusters[cluster_id] = randn_bm()[0];
} else {
/* sample from a cluster */
var accum_card_cluster = 0, last_cluster = 0;
for(let num_per_cluster of card_cluster){
if(accum_card_cluster < cur_prob
&& cur_prob < accum_card_cluster + num_per_cluster){
cluster_id = last_cluster;
break;
}
accum_card_cluster += num_per_cluster;
last_cluster += 1;
}
}
/* once found, accumulate the current cluster */
card_cluster[cluster_id] += 1;
states[n] = [cluster_id, card_cluster[cluster_id]];
}
var num_cluster = card_cluster.length,
barWidth = Math.floor(width / num_cluster) - 1;
var color = d3.scaleLinear().domain([1, num_cluster])
.interpolate(d3.interpolateHcl)
.range([d3.rgb("#007AFF"), d3.rgb('#FFF500')]);
var max_card = d3.max(card_cluster);
height_s = height / max_card;
width_s = width / num_cluster;
/* Axes */
var x = d3.scaleLinear()
.range([barWidth / 2, width - barWidth / 2]);
var y = d3.scaleLinear()
.range([height, 0]);
/* Scale the range of the data */
x.domain([1, num_cluster]);
y.domain([0, max_card]);
/* An SVG element with a bottom-right origin. */
svg = d3.select("#dp").append("svg")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform", "translate(" + margin.left + "," + margin.top + ")");
/* Add the x Axis */
svg.append("g")
.attr("transform", "translate(0," + height + ")")
.call(d3.axisBottom(x));
/* Add the y Axis */
svg.append("g")
.call(d3.axisLeft(y));
/* text label for the x axis */
svg.append("text")
.attr("transform",
"translate(" + (width/2) + " ," + (height + margin.top + f_h * 0.03) + ")")
.style("text-anchor", "middle")
.text("Cluster ID");
/* text label for the y axis */
svg.append("text")
.attr("transform", "rotate(-90)")
.attr("y", 0 - margin.left)
.attr("x",0 - (height / 2))
.attr("dy", "1em")
.style("text-anchor", "middle")
.text("Number of samples");
var rect = svg.selectAll("rect")
.data(card_cluster)
.enter().append("rect")
.attr("class", function(d, i){return ('dp-' + i);})
.attr("width", 0.98 * width_s)
.attr("height", 0)
.attr("fill", function(d, i){return color(i);})
.attr("transform", function(d, i) { return "translate(" + i * width_s + "," + (height - height_s * d) + ")"; });
/* An SVG element with a bottom-right origin. */
svg_n = d3.select("#dp_gaussian").append("svg")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform", "translate(" + margin.left + "," + margin.top + ")");
min_x = d3.min(clusters);
max_x = d3.max(clusters);
x_width = max_x - min_x;
var x_n = d3.scaleLinear()
.range([barWidth / 2, width - barWidth / 2]);
var y_n = d3.scaleLinear()
.range([height, 0]);
x_n.domain([min_x, max_x]);
y_n.domain([0, max_card]);
svg_n.append("g")
.attr("transform", "translate(0," + height + ")")
.call(d3.axisBottom(x_n));
svg_n.append("g")
.call(d3.axisLeft(y_n));
svg_n.append("text")
.attr("transform",
"translate(" + (width/2) + " ," + (height + margin.top + f_h * 0.03) + ")")
.style("text-anchor", "middle")
.text("x");
svg_n.append("text")
.attr("transform", "rotate(-90)")
.attr("y", 0 - margin.left)
.attr("x",0 - (height / 2))
.attr("dy", "1em")
.style("text-anchor", "middle")
.text("Number of samples");
var rect = svg_n.selectAll("rect")
.data(card_cluster)
.enter().append("rect")
.attr("class", function(d, i){return ('dp-n-' + i);})
.attr("width", 0.98 * width_s)
.attr("height", 0)
.attr("fill", function(d, i){return color(i);})
.attr("transform", function(d, i) { return "translate(" + width * (clusters[i] - min_x) / x_width + "," + (height - height_s * d) + ")"; });
}; /* Onload */
function start_dp() {
d3.select("#start_dp_button").attr('disabled', true);
increment(0);
}
function increment(step) {
if(step >= states.length){
return 0;
}
cluster_id = states[step][0];
cardinality = states[step][1];
var cur_rect = svg.select('.dp-'+states[step][0]);
d3.select('.dp-'+states[step][0])
.transition()
.duration(speed)
.attr("height", height_s * cardinality)
.attr("transform", function(d) { return "translate(" + cluster_id * width_s + "," + (height - height_s * cardinality) + ")"; })
.on("end", () => increment(++step));
}
function start_dp_n() {
d3.select("#start_dp_n_button").attr('disabled', true);
increment_n(0);
}
function increment_n(step) {
if(step >= states.length){
return 0;
}
cluster_id = states[step][0];
cardinality = states[step][1];
var cur_rect = svg.select('.dp-n-'+states[step][0]);
d3.select('.dp-n-'+states[step][0])
.transition()
.duration(speed)
.attr("height", height_s * cardinality)
.attr("transform", function(d) { return "translate(" + width * (clusters[cluster_id] - min_x) / x_width + "," + (height - height_s * cardinality) + ")"; })
.on("end", () => increment_n(++step));
}
</script>
<h2 id="edits">Edits</h2>
<ul>
<li>2017/Mar/1 Fixed MathJax \mathbf{\theta} rendering problem, correct grammatical errors</li>
</ul>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf">Teh et al., Dirichlet Process, 2010</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduUniversal Correspondence Network2016-09-26T00:09:01-07:002016-09-26T00:09:01-07:00https://chrischoy.github.io/publication/universal-correspondence-network<p><img src="https://chrischoy.github.io/images/publication/ucn/overview_sm.png" alt="Overview" /></p>
<p>In this post, we will give a very high-level overview of the paper in layman’s terms.
I’ve received some questions regarding what the Universal Correspondence
Network (UCN) is and the limitations of it. This post will answer some of the
questions and hopefully facilitate research on derivatives applications of correspondences.</p>
<h2 id="patch-similarity">Patch Similarity</h2>
<p>Measuring the similarity of image patches is a basic element of high level
operations, such as 3D reconstruction, tracking, registration, etc.
The most common and widely used patch similarity is probably “geometric similarity”.
In the geometric similarity, we are interested in finding image patches from
two different cameras of the same 3D point.</p>
<p>For instance, in stereo vision, we are observing a scene from two different
cameras. Since the cameras are placed a certain distance apart, images from two
cameras give us different observations of the same scene. We are interested in
finding image patches from respective viewpoints that correspond to the same 3D
point in the scene.</p>
<p>Another type of similarity is “semantic correspondence”. As the name suggests,
in this problem, we are interested in finding the same semantic parts. For
instance a left paw of a dog and a left paw of a cat are semantically, and
functionally equivalent. In the semantic similarity, we are interested in
finding image patches of the same semantic object.</p>
<h2 id="measuring-patch-similarity">Measuring Patch Similarity</h2>
<p>Traditionally, measuring similarity is done by measuring the distance of
features extracted from corresponding image patches. However, such features
require a lot of hand design and heuristics which results in sub-optimal
performance. After the series of successes of CNNs on replacing a lot of
hand-designed steps in Computer Vision applications, CNN-base similarity
measures have been introduced as well.</p>
<p>In CNN base patch similarity measure, a Convolutional Neural Network takes two
image patches as inputs and generates a score that measures the likelihood of
the patch similarity. However, since the network has to take both patches, the
time complexity of the comparison is $O(N^2)$ where $N$ is the number of
patches.</p>
<p>Instead, some methods cache the CNN outputs for each patch and only use Fully
Connected layer feed forward passes $O(N^2)$ times. Still, the neural network
feed forward passes are expensive compare to simple distance operations.</p>
<p>Another type of CNN uses intermediate FC outputs as surrogate features, but metric operations (distance) on this space is not defined. In another word, the target task
of such neural networks is based on metric operation (distance), but the
neural network does not know the concept when it is trained.</p>
<h2 id="universal-correspondence-network">Universal Correspondence Network</h2>
<p>To improve all of points mentioned in the previous section, we propose incorporating three concepts into a neural network.</p>
<ol>
<li>Deep Metric Learning for Patch Similarity</li>
<li>Fully Convolutional Feature Extraction</li>
<li>Convolutional Spatial Transformer</li>
</ol>
<h3 id="deep-metric-learning-for-patch-similarity">Deep Metric Learning for Patch Similarity</h3>
<p>To minimize the number of CNN feed forward passes, many researchers introduced various techniques. In this paper,
we propose using deep metric learning for patch similarity. Metric Learning is a
type of learning algorithm that allows the ML model to form a metric space
where metric operations are interpretable (i.e. distance).
In essence, metric learning starts from a set of constraints that forces
similar objects to be closer to each other and dissimilar objects to be at least a margin apart. Since the distance operations are encoded in the learning, using distance during testing (target task) yields meaningful results.</p>
<p>(Disclaimer) During the review process, Li et al.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> independently proposed combining metric learning with a neural network for patch similarity. However, the network is geared toward reconstruction framework and uses patch-wise feature extraction, whereas the UCN uses Fully Convolutional feature extraction.</p>
<h3 id="fully-convolutional-feature-extraction">Fully Convolutional Feature Extraction</h3>
<p>Unlike previous approaches where CNNs only takes a pair of fixed-sized patches,
we propose a fully convolutional neural network to speed up the feature extraction.
Advantage of using a fully convolutional neural network is that the network can reuse
the computations for overlapping image patches (Long et al.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>).</p>
<p>For example, if we extract image patches and use a patch-based CNN, even if there is
an overlap among patches, we have to compute all activations again from scratch. Whereas if we
use a fully convolutional neural network, we can reuse computation for all
overlapping regions and thus can speed up computation.</p>
<p>However, this may leads to fixed fovea size and rotation which can be fixed by
incorporating the spatial transformer network (next section).</p>
<h3 id="convolutional-spatial-transformer-layer">Convolutional Spatial Transformer Layer</h3>
<p>One of the most successful hand-designed features in Computer Vision is probably the SIFT feature (D. Lowe<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>). Though the
feature itself is based on pooling and aggregation of simple edge strengths, the way the feature is extracted is also a
part of the feature and affects the performance greatly.</p>
<p>The extraction process tries to normalize features so that the viewpoint does not affect the feature too much.
This process is the patch normalization. Specifically, given a strong
gradient direction in an image patch, the feature find the optimal rotation as well as the scale and then computes features using the right coordinate system using the optimal rotation and scale.
We implement the same idea in a neural network by adopting the Spatial
Transformer Network into the UCN (Jaderberg et al.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>). The UCN, however, is fully convolutional and
the features are dense. For this, we propose the Convolutional Spatial Transformer that applies <strong>independent</strong> transformation to each and every feature.</p>
<h2 id="what-ucn-can-and-cannot-do">What UCN can and cannot do</h2>
<p>We designed a network for basic feature extraction and <strong>basic feature
extraction only</strong>. This is not a replacement for high-level filtering processes that use
extra supervision/inputs such as Epipolar geometry. Rather, we provide a base
network for future research and thus the base network <strong>should not be confused
with a full blown framework for a specific application</strong>.</p>
<p>One of the questions that I got at NIPS is why we did not compare with stereo
vision. This question might have arisen because we used stereo benchmarks to
measure the performance of the <strong>raw features</strong> for geometric correspondence task.
However, as mentioned before, the raw feature does not use extra inputs and is
not equivalent to a full system that makes use of extra inputs and constraints
specific for stereo vision: Epipolar Line search, which significantly reduces
the search space from an entire image to a line, and post processing filtering
stages. More detailed explanation is provided in the following section.</p>
<h3 id="stereo-vision">Stereo Vision</h3>
<p>For the baseline comparisons, we did not use full systems that make use of
extra inputs: <strong>camera intrinsic and extrinsic parameters</strong>. If you remember
how painful the cameras calibrationl was when you took a Computer
Vision class, you might see that camera extrinsic parameters and intrinsic
parameters are strong extra inputs. Further, such extra inputs allow you to
drastically reduce the search space for geometric correspondence.</p>
<p>Such constraint is known as the Epipolar constraint and is a powerful
constraint that comes at a price. Since our framework is not a specialized for
a stereo vision, we did not compare with systems that make use of this extra
input/supervision.</p>
<p>Instead, we use all latest hand-designed features, FC layer features, and deep
matching<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup> since all of those do not use extra camera parameters.
In this paer, we focus <strong>only on the quality of the features</strong> that can be used
for more complex systems.</p>
<p>In addition, in UCN, to show the quality of the feature, we simply used the
most stupid way to find the correspondence: nearest neighbor on the entire
images, which is very inefficient, but shows the quality of the raw features.
To be able to compare it with the stereo vision systems, extra constraints
(Epipolar geometry + CRF filtering) should be incorporated into both training
and testing.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We propose an efficient feature learning method for various correspondence
problems. We minimized the number of feed forward passes, incorporated metric
space into a neural network, and
proposed a convolutional spatial transformer to mimic behavior of one of the most successful hand
designed features. However, this is not a replacement for the more complex system.
Rather, this is a novel way to generate base features for a complex system that
require correspondences as an input. I hope this blog post had resolved some of
the questions regarding the UCN and facilitated future research.</p>
<h2 id="additional-resources">Additional Resources</h2>
<ul>
<li><a href="https://1drv.ms/p/s!AjAGaOEcFeieiQmKg-dicMrAJzPo">Slides</a></li>
<li><a href="http://cvgl.stanford.edu/projects/ucn/">Project page</a></li>
</ul>
<h2 id="references">References</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Li et al. Learned Invariant Feature Transform, 2016 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Long et al. Fully Convolutional Neural Network for Semantic Segmentation, 2014 <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>D. Lowe, Distinctive Image Features from Scale Invariant Keypoints, 2004 <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Jaderberg et al., Spatial Transformer Networks, 2015 <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>Revaud et al., Deep Matching, 2013 <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Christopher B. Choy3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction2016-09-23T00:17:02-07:002016-09-23T00:17:02-07:00https://chrischoy.github.io/publication/r2n2<h2 id="abstract">Abstract</h2>
<p>Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data~\cite{shapenet}. Our network takes in one or more images of an object instance
from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing.
Our extensive experimental analysis shows that our reconstruction framework i) outperforms the state-of-the-art methods for single view reconstruction, and ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).</p>
<ul>
<li><a href="https://arxiv.org/abs/1604.00449">arXiv paper</a></li>
<li><a href="https://docs.google.com/presentation/d/1RFGh0HMKNGH4vqTkzl11rSkMQbJuIybdQpDInLjk8sM/edit?usp=sharing">Slides</a></li>
<li><a href="http://github.com/chrischoy/3d-r2n2/">Code</a></li>
</ul>
<h2 id="3d-r2n2-3d-recurrent-reconstruction-neural-network">3D-R2N2: 3D Recurrent Reconstruction Neural Network</h2>
<p>This repository contains the source codes for the paper <a href="http://arxiv.org/abs/1604.00449">Choy et al., 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction, ECCV 2016</a>. Given one or multiple views of an object, the network generates voxelized ( a voxel is the 3D equivalent of a pixel) reconstruction of the object in 3D.</p>
<h2 id="citing-this-work">Citing this work</h2>
<p>If you find this work useful in your research, please consider citing:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>@inproceedings{choy20163d,
title={3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction},
author={Choy, Christopher B and Xu, Danfei and Gwak, JunYoung and Chen, Kevin and Savarese, Silvio},
booktitle = {Proceedings of the European Conference on Computer Vision ({ECCV})},
year={2016}
}
</code></pre>
</div>
<h2 id="overview">Overview</h2>
<p><img src="https://chrischoy.github.io/images/publication/r2n2/overview.png" alt="Overview" />
<em>Left: images found on Ebay, Amazon, Right: overview of <code class="highlighter-rouge">3D-R2N2</code></em></p>
<p>Traditionally, single view reconstruction and multi-view reconstruction are disjoint problems that have been dealt using different approaches. In this work, we first propose a unified framework for both single and multi-view reconstruction using a <code class="highlighter-rouge">3D Recurrent Reconstruction Neural Network</code> (3D-R2N2).</p>
<table>
<thead>
<tr>
<th style="text-align: center">3D-Convolutional LSTM</th>
<th style="text-align: center">3D-Convolutional GRU</th>
<th style="text-align: center">Inputs (red cells + feature) for each cell (purple)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="https://chrischoy.github.io/images/publication/r2n2/lstm.png" alt="3D-LSTM" /></td>
<td style="text-align: center"><img src="https://chrischoy.github.io/images/publication/r2n2/gru.png" alt="3D-GRU" /></td>
<td style="text-align: center"><img src="https://chrischoy.github.io/images/publication/r2n2/lstm_time.png" alt="3D-LSTM" /></td>
</tr>
</tbody>
</table>
<p>We can feed in images in random order since the network is trained to be invariant to the order. The critical component that enables the network to be invariant to the order is the <code class="highlighter-rouge">3D-Convolutional LSTM</code> which we first proposed in this work. The <code class="highlighter-rouge">3D-Convolutional LSTM</code> selectively updates parts that are visible and keeps the parts that are self-occluded.</p>
<p><img src="https://chrischoy.github.io/images/publication/r2n2/analysis.png" alt="LSTM Analysis" />
<em>Visualization of the 3D-Convolutional LSTM input gate activations. The images are fed into the network sequentially from left to right (top row). Visualization of input gate activations. The input gates corresponding to the parts that are visible and mismatch prediction open and update its hidden state (middle row). Corresponding prediction at each time step (bottom row).</em></p>
<p><img src="https://chrischoy.github.io/images/publication/r2n2/full_network.png" alt="Networks" />
<em>We used two different types of networks for the experiments: a shallow network (top) and a deep residual network (bottom).</em></p>
<h2 id="datasets">Datasets</h2>
<p>We used <a href="http://shapenet.cs.stanford.edu">ShapeNet</a> models to generate rendered images and voxelized models which are available below (you can follow the installation instruction below to extract it to the default directory).</p>
<ul>
<li>ShapeNet rendered images <a href="ftp://cs.stanford.edu/cs/cvgl/ShapeNetRendering.tgz">ftp://cs.stanford.edu/cs/cvgl/ShapeNetRendering.tgz</a></li>
<li>ShapeNet voxelized models <a href="ftp://cs.stanford.edu/cs/cvgl/ShapeNetVox32.tgz">ftp://cs.stanford.edu/cs/cvgl/ShapeNetVox32.tgz</a></li>
</ul>
<h2 id="codes">Codes</h2>
<p>The source codes for the project can be found at the .</p>Christopher B. ChoyAbstractObjectNet3D: A Large Scale Database for 3D Object Recognition2016-09-21T00:09:01-07:002016-09-21T00:09:01-07:00https://chrischoy.github.io/publication/objectnet3d<p><img src="https://chrischoy.github.io/images/publication/objectnet3d/ObjectNet3D.png" alt="Overview" /></p>
<h2 id="abstract">Abstract</h2>
<p>We contribute a large scale database for 3D object recognition, named ObjectNet3D, that consists of 100 categories, 90,127 images, 201,888 objects in these images and 44,147 3D shapes. Objects in the images in our database are aligned with the 3D shapes, and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object. Consequently, our database is useful for recognizing the 3D pose and 3D shape of objects from 2D images. We also provide baseline experiments on four tasks: region proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, and image-based 3D shape retrieval, which can serve as baselines for future research using our database.</p>
<ul>
<li><a href="http://cvgl.stanford.edu/papers/xiang_eccv16.pdf">Paper</a></li>
<li><a href="http://cvgl.stanford.edu/projects/objectnet3d/">Project Page</a></li>
<li><a href="https://github.com/yuxng/ObjectNet3D_toolbox">Toolbox</a></li>
</ul>
<h2 id="acknowledgement">Acknowledgement</h2>
<p>We acknowledge the support of NSF grants IIS-1528025 and DMS-1546206, a Google Focused Research award, and grant SPO # 124316 and 1191689-1-UDAWF from the Stanford AI Lab-Toyota Center for Artificial Intelligence Research.</p>Yu Xiang