Jekyll2019-01-01T19:57:12-08:00https://chrischoy.github.io/feed.xmlComputer Vision'er'Chris Choychrischoy@ai.stanford.eduPytorch Extension with a Makefile2018-12-28T15:04:17-08:002018-12-28T15:04:17-08:00https://chrischoy.github.io/research/pytorch-extension-with-makefile<p>Pytorch is a great neural network library that has both flexibility and power. Personally, I think it is the best neural network library for prototyping (advanced) dynamic neural networks fast and deploying it to applications.</p>
<p>Recently, pytorch was upgraded to version 1.0 and introduced the ATen tensor library for all backend and c++ custom extension. Before the c++ extension, it supported CFFI (C Foreign Function Import) for a custom extension.</p>
<p>As an avid CUDA developer, I created multiple projects to speed up custom pytorch layers using the CFFI interface. However, wrapping functions with a non object oriented program (C) sometimes led to a ridiculous overhead when complex objects are required. Now that it supports the latest technology from 2011, c++11, we can now use object oriented programming for pytorch extensions!</p>
<p>In this tutorial, I will cover soe drawbacks of the current setuptools and will show you how to use a Makefile for pytorch cpp extension development. The source codes for the tutorial can be found <a href="https://github.com/chrischoy/MakePytorchPlusPlus">here</a>.</p>
<p>Before you proceed, please read the <a href="https://pytorch.org/tutorials/advanced/cpp_extension.html">official Pytorch CPP extension guide</a> which provides an extensive and useful tutorial for how to create a C++ extension with ATen.</p>
<h2 id="drawbacks-of-setuptools-for-development">Drawbacks of Setuptools for Development</h2>
<p>However, the <code class="highlighter-rouge">setuptool</code> is not really flexible as it primarily focuses on the deployment of a project. Thus, it lacks a lot of features that are essential for fast development. Let’s delve into few scenarios that I encountered while I was porting my pytorch cffi extensions to cpp extensions.</p>
<h3 id="compile-only-updated-files">Compile only updated files</h3>
<p>When you develop a huge project, you don’t want to compile the entire project everytime you make a small change. However, if you use the setuptool, it creates objects for ALL source files, everytime you make a change. This becomes extremely cumbersome especially when your project gets larger.</p>
<p>However, <code class="highlighter-rouge">Makefile</code> allows you to cache all object files as you have control over all files and compile only the files that are updated. This is extremely useful if you made a small change to one file and want to quickly debug your project.</p>
<h3 id="parallel-compilation">Parallel Compilation</h3>
<p>Another problem with the setuptool is that it compiles files sequentially. When your project gets huge, you might want to compile a lot of files in parallel. With a Makefile, you can parallelize compilation with the <code class="highlighter-rouge">-j#</code> flag. For example, if you type <code class="highlighter-rouge">make -j8</code>, it would compile 8 files in parallel.</p>
<h3 id="debugging">Debugging</h3>
<p>The current pytorch c++ extension does not allow debugging even with the debug flag. Instead, with a Makefile, you could pass <code class="highlighter-rouge">-g</code> (or <code class="highlighter-rouge">-g -G</code> for nvcc) with ease.
In the <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/Makefile">Makefile</a>, uncomment the line 3 <code class="highlighter-rouge">DEBUG=1</code> and the line 20 of <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/setup.py">setup.py</a>.</p>
<h2 id="making-a-pytorch-extension-with-a-makefile">Making a pytorch extension with a Makefile</h2>
<p>Now that we covered some of advantages of using a Makefile for a pytorch cpp extension, let’s get into the details of how to make one.</p>
<h3 id="creating-objects-and-functions">Creating Objects and Functions</h3>
<p>As an example, in this tutorial, we will create a class and a cuda function that are callable in python. First, let’s make a simple class that provides a setter and a getter for a private variable <code class="highlighter-rouge">key_</code>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Foo</span> <span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="kt">uint64_t</span> <span class="n">key_</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">setKey</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">key</span><span class="p">);</span>
<span class="kt">uint64_t</span> <span class="n">getKey</span><span class="p">();</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">toString</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span>
<span class="k">return</span> <span class="s">"< Foo, key: "</span> <span class="o">+</span> <span class="n">std</span><span class="o">::</span><span class="n">to_string</span><span class="p">(</span><span class="n">key_</span><span class="p">)</span> <span class="o">+</span> <span class="s">" > "</span><span class="p">;</span>
<span class="p">};</span>
<span class="p">};</span>
</code></pre></div></div>
<p>We will fill out the setter and the getter functions in <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/src/foo.cpp">foo.cpp</a>. Next, I created a simple CUDA function that adds two vectors and returns results in a new <code class="highlighter-rouge">at::Tensor</code>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">Dtype</span><span class="o">></span>
<span class="n">__global__</span> <span class="kt">void</span> <span class="n">sum</span><span class="p">(</span><span class="n">Dtype</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">int</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o"><=</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
<span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">Dtype</span><span class="o">></span>
<span class="kt">void</span> <span class="n">AddGPUKernel</span><span class="p">(</span><span class="n">Dtype</span> <span class="o">*</span><span class="n">in_a</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">in_b</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">out_c</span><span class="p">,</span> <span class="kt">int</span> <span class="n">N</span><span class="p">,</span>
<span class="n">cudaStream_t</span> <span class="n">stream</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span><span class="o"><</span><span class="n">Dtype</span><span class="o">></span>
<span class="o"><<<</span><span class="n">GET_BLOCKS</span><span class="p">(</span><span class="n">N</span><span class="p">),</span> <span class="n">CUDA_NUM_THREADS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">stream</span><span class="o">>>></span><span class="p">(</span><span class="n">in_a</span><span class="p">,</span> <span class="n">in_b</span><span class="p">,</span> <span class="n">out_c</span><span class="p">,</span> <span class="n">N</span><span class="p">);</span>
<span class="n">cudaError_t</span> <span class="n">err</span> <span class="o">=</span> <span class="n">cudaGetLastError</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cudaSuccess</span> <span class="o">!=</span> <span class="n">err</span><span class="p">)</span>
<span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">(</span><span class="n">Formatter</span><span class="p">()</span>
<span class="o"><<</span> <span class="s">"CUDA kernel failed : "</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">to_string</span><span class="p">(</span><span class="n">err</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note that I call <code class="highlighter-rouge">std::runtime_error</code> when it gives an error. Pybind11 automatically converts <code class="highlighter-rouge">std</code> exceptions to python exception types. For example, the <code class="highlighter-rouge">std::runtime_error</code> will be mapped to <code class="highlighter-rouge">RuntimeError</code> in python. This prevents the system from crashing and allows the python to handle errors gracefully. More error handling with pybind11 can be found at <a href="https://pybind11.readthedocs.io/en/stable/advanced/exceptions.html">here</a>.</p>
<h3 id="bridging-cpp-with-pybind">Bridging CPP with Pybind</h3>
<p>Pytorch passes tensors as the <code class="highlighter-rouge">at::Tensor</code> type. To extract the mutable raw pointer, use <code class="highlighter-rouge">.data<Dtype>()</code>. For example, if you want to extract the raw pointer from a variable <code class="highlighter-rouge">A</code> of type <code class="highlighter-rouge">float</code>, use <code class="highlighter-rouge">A.data<float>()</code>. In addition, if you want to use the CUDA stream for the current context, use the function <code class="highlighter-rouge">at::cuda::getCurrentCUDAStream()</code>.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>template <typename Dtype>
void AddGPU(at::Tensor in_a, at::Tensor in_b, at::Tensor out_c) {
int N = in_a.numel();
if (N != in_b.numel())
throw std::invalid_argument(Formatter()
<< "Size mismatch A.numel(): " << in_a.numel()
<< ", B.numel(): " << in_b.numel());
out_c.resize_({N});
AddGPUKernel<Dtype>(in_a.data<Dtype>(), in_b.data<Dtype>(),
out_c.data<Dtype>(), N, at::cuda::getCurrentCUDAStream());
}
</code></pre></div></div>
<p>The above function can be directly called from python with pybind11. Now, let’s bind the cpp function and the class with python.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">namespace</span> <span class="n">py</span> <span class="o">=</span> <span class="n">pybind11</span><span class="p">;</span>
<span class="n">PYBIND11_MODULE</span><span class="p">(</span><span class="n">TORCH_EXTENSION_NAME</span><span class="p">,</span> <span class="n">m</span><span class="p">){</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"Foo"</span><span class="p">);</span>
<span class="n">py</span><span class="o">::</span><span class="n">class_</span><span class="o"><</span><span class="n">Foo</span><span class="o">></span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">name</span><span class="p">.</span><span class="n">c_str</span><span class="p">())</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="n">py</span><span class="o">::</span><span class="n">init</span><span class="o"><></span><span class="p">())</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"setKey"</span><span class="p">,</span> <span class="o">&</span><span class="n">Foo</span><span class="o">::</span><span class="n">setKey</span><span class="p">)</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"getKey"</span><span class="p">,</span> <span class="o">&</span><span class="n">Foo</span><span class="o">::</span><span class="n">getKey</span><span class="p">)</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"__repr__"</span><span class="p">,</span> <span class="p">[](</span><span class="k">const</span> <span class="n">Foo</span> <span class="o">&</span><span class="n">a</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">a</span><span class="p">.</span><span class="n">toString</span><span class="p">();</span> <span class="p">});</span>
<span class="n">m</span><span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"AddGPU"</span><span class="p">,</span> <span class="o">&</span><span class="n">AddGPU</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For classes, need to use <code class="highlighter-rouge">py::class_<CLASS></code> to let the pybind to know it is a class. Then, define functions that you want to expose to python with <code class="highlighter-rouge">.def</code>.
For functions, you can simply attach the function using <code class="highlighter-rouge">.def</code> directly.</p>
<h3 id="compiling-the-project">Compiling the project</h3>
<p>Now that we have all source files ready, let’s compile them. First, we will make an archive library that contains all classes and functions. Then, we can compile the file that bind all functions and classes with pybind11 with the setuptools and load it in python.</p>
<h4 id="finding-the-arguments-and-the-include-paths">Finding the Arguments and the Include Paths</h4>
<p>First, we have to find the right arguments used to compile the pytorch extension. It is actually easy to find since when you compile your project using the setuptools, you can see the actual compilation command that it invokes and we can deduce what would be required to make a project from a Makefile. Extra arguments that it uses for pytorch extensions are</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=$(EXTENSION_NAME) -D_GLIBCXX_USE_CXX11_ABI=0
</code></pre></div></div>
<p>In addition, we need to find headers. The pytorch provides CPP extensions with setuptools and we could see how it finds the headers and libraries. In <code class="highlighter-rouge">torch.utils.cpp_extension</code> you can find the function <code class="highlighter-rouge">include_paths</code>, which provides all header paths. We only need to pass it to the Makefile. Within a Makefile, we can run a python command and get the paths like the following.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PYTHON_HEADER_DIR := $(shell python -c 'from distutils.sysconfig import get_python_inc; print(get_python_inc())')
</code></pre></div></div>
<p>Note that the command above prints out all paths line by line, so, in the end, we can iterate over the paths in the Makefile to prepend <code class="highlighter-rouge">-I</code>.
The final makefile can be found at <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/Makefile">here</a>.</p>
<h4 id="archive-libraries">Archive Libraries</h4>
<p>Once we build library files, we create an archive file and link it to the main pybind entries.
We can do so by</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ar rc $(STATIC_LIB) $(OBJS) $(CU_OBJS)
</code></pre></div></div>
<h4 id="compiling-the-bind-file">Compiling the bind file</h4>
<p>When the archive library is ready, we can finally compile the bind file that will linke the classes and functions with the binding.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">setuptools</span> <span class="kn">import</span> <span class="n">setup</span>
<span class="n">setup</span><span class="p">(</span>
<span class="o">...</span>
<span class="n">ext_modules</span><span class="o">=</span><span class="p">[</span>
<span class="n">CUDAExtension</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s">'MakePytorchBackend'</span><span class="p">,</span>
<span class="n">include_dirs</span><span class="o">=</span><span class="p">[</span><span class="s">'./'</span><span class="p">],</span>
<span class="n">sources</span><span class="o">=</span><span class="p">[</span>
<span class="s">'pybind/bind.cpp'</span><span class="p">,</span>
<span class="p">],</span>
<span class="n">libraries</span><span class="o">=</span><span class="p">[</span><span class="s">'make_pytorch'</span><span class="p">],</span>
<span class="n">library_dirs</span><span class="o">=</span><span class="p">[</span><span class="s">'objs'</span><span class="p">],</span>
<span class="c"># extra_compile_args=['-g']</span>
<span class="p">)</span>
<span class="p">],</span>
<span class="n">cmdclass</span><span class="o">=</span><span class="p">{</span><span class="s">'build_ext'</span><span class="p">:</span> <span class="n">BuildExtension</span><span class="p">},</span>
<span class="o">...</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Finally, if we automatically call the <code class="highlighter-rouge">setup.py</code> in python, we can only issue <code class="highlighter-rouge">make -j8</code> to compile all files, binding and install it in the python library.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>all: $(STATIC_LIB)
python setup.py install --force
</code></pre></div></div>Chris Choychrischoy@ai.stanford.eduPytorch is a great neural network library that has both flexibility and power. Personally, I think it is the best neural network library for prototyping (advanced) dynamic neural networks fast and deploying it to applications.Short Note on Matrix Differentials and Backpropagation2018-01-10T18:55:12-08:002018-01-10T18:55:12-08:00https://chrischoy.github.io/research/Matric-Calculus<p>Mathematical notation is the convention that we all use to denote a concept
in a concise mathematical formulation, yet sometimes there is more than one
way to express the same equation. For example, we can use Leibniz’s notation
$\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y},
\ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations,
we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of
using the definition, we can use the frequency domain representations and
simply solve differential equations using basic algebra.
In this post, I’ll cover a matrix differential notation and how to use
differentials to derive backpropagation functions easily.</p>
<h2 id="differentials">Differentials</h2>
<p>Let’s first define the differential. Let a vector function $f(x)$ be
differentiable at $c$ and the first-order Taylor approximation is</p>
$$
f(c + u) = f(x) + f'(c)u + r(u)
$$
<p>where $r$ denotes the remainder. We denote $\mathsf{d}f(c;u) = u f’(c)$, the
differential of $f$ at $c$ with increment $u$. This of course can also be
denoted simply using partial derivatives.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
$$
\mathsf{d}f(c; u) = (\mathsf{D}f(c))u
$$
<p>where $\mathsf{D}_j f_i(c)$ denotes the partial derivative of $f_i$ with
respect to the $j$-th coordinate at $c$. The matrix $\mathsf{D}f(c)$ is the
Jacobian matrix and it’s the transpose of the gradient of $f$ at $c$.</p>
$$
\nabla f(c) = (\mathsf{D}f(c))^T
$$
<h3 id="chain-rule">Chain Rule</h3>
<p>Let $h = g \circ f$, the differential of $h$ at $c$ is</p>
\begin{align}
\mathsf{d}h(c; u) & = (\mathsf{D}(h(c))u \\
& = (\mathsf{D}g(f(c)))(\mathsf{D}f(c))u = (\mathsf{D}g(f(c)))\mathsf{d}f(c;u) \\
& = \mathsf{d}g(f(c); \mathsf{d}f(c;u))
\end{align}
<p>We can further simplify the notation by replacing $\mathsf{d}f(c;u)$ with
$\mathsf{d}f$ when it unambiguously represents the differential concerning
the input variable.</p>
<h3 id="matrix-function">Matrix Function</h3>
<p>Now let’s extend this to a matrix function. Let $F: S \rightarrow
\mathbb{R}^{m\times p}$ be a matrix function defined on $S \in \mathbb{R}^{n
\times q}$.</p>
$$
\text{vec}F(C+U) = \text{vec} F(C) + F'(C) \text{vec}U + \text{vec}R(U)
$$
<p>We can denote the differential of $F$ at $C$ as</p>
$$
\text{vec}\; \mathsf{d}F(C;U) = F'(C) \text{vec}U
$$
<h2 id="matrix-backpropagation">Matrix Backpropagation</h2>
<p>\todo{Introduce how we use the differential for backpropagation}
Let $A : B = \text{Tr}(A^TB) = \text{vec}(A) \text{vec}(B)^T$, sum of all
elements in $A \circ B$. If we let $F(X) = Y$ and $L$ be the final loss,</p>
\begin{align}
\mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}f \\
& = \mathsf{D} L : \mathcal{L}(\mathsf{d}X) \\
& = \mathcal{L}^*(\mathsf{D} L) : \mathsf{d}X
\end{align}
<p>where we denote $\mathsf{d}Y = \mathcal{L}(\mathsf{d}X)$ and $\mathsf{D}L =
\frac{\partial L}{\partial Y}$. So given gradients from the upper layer, the
gradient with respect to $X$ can easily be computed by finding the function
$\mathcal{L}^*$, the adjoint of $\mathcal{L}$.</p>
<h3 id="example-1-linear-function">Example 1: Linear Function</h3>
<p>Let $Y = f(X) = AX + b$, then,</p>
\begin{align}
\mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}(AX + b) \\
& = \mathsf{D} L : A\mathsf{d}X = \text{Tr}(\mathsf{D}L (A \mathsf{d}X)^T) \\
& = \text{Tr}(\mathsf{D}L \mathsf{d}X^T A^T) = \text{Tr}(A^T \mathsf{D}L \mathsf{d}X^T) \\
& = A^T \mathsf{D}L : \mathsf{d}X
\end{align}
<p>Thus $\mathcal{L}^*(Y) = A^TY$.</p>
<h3 id="example-2-constrained-optimization">Example 2: Constrained Optimization</h3>
<p>We would like to solve the following constrained optimization problem.</p>
\begin{equation*}
\begin{aligned}
& \underset{x}{\text{minimize}}
& & f(x) \\
& \text{subject to}
& & Ax = b.
\end{aligned}
\end{equation*}
<p>The Lagrangian and the primal and dual feasibility equations are</p>
$$
\mathcal{L}(x, \nu) = f(x) + \nu^T(Ax - b) \\
Ax^* = b, \;\; \nabla f(x) + A^T \nu^* = 0
$$
<p>If we take the first order approximation of the primal and dual feasibility
equations,</p>
\begin{align}
Ax + \mathsf{d}(Ax) & = b + \mathsf{d}b\\
Ax + \mathsf{d}Ax + A\mathsf{d}x & = b + \mathsf{d}b\\
\nabla f(x) + A^T \nu + \mathsf{d}(\nabla f(x) + A^T \nu) & = 0 \\
\nabla f(x) + A^T \nu + \nabla^2 f(x) \mathsf{d}x + \mathsf{d}A^T \nu + A^T \mathsf{d}\nu & = 0
\end{align}
<p>Or more concisely,</p>
$$
\begin{bmatrix} \nabla^2 f(x) & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} \mathsf{d}x \\ \mathsf{d}\nu \end{bmatrix} = - \begin{bmatrix} f(x) + A^T \nu + \mathsf{d}A^T\nu \\ Ax - b + \mathsf{d}Ax - \mathsf{d}b \end{bmatrix}
$$
<p>This is the same as the infeasible start Newton method <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we covered the notation for matrix differentials and matrix
backpropagation. Simple notation can ease the burden of derivation and also
lead to fewer mistakes.</p>
<h2 id="references">References</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="http://www.janmagnus.nl/misc/mdc2007-3rdedition">J. Magnus, Matrix Differential Calculus with Applications in Statistics</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf">S. Boyd and L. Vandenberghe, Convex Optimization</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduMathematical notation is the convention that we all use to denote a concept in a concise mathematical formulation, yet sometimes there is more than one way to express the same equation. For example, we can use Leibniz’s notation $\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y}, \ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations, we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of using the definition, we can use the frequency domain representations and simply solve differential equations using basic algebra. In this post, I’ll cover a matrix differential notation and how to use differentials to derive backpropagation functions easily.Regression vs. Classification: Distance and Divergence2018-01-05T15:06:57-08:002018-01-05T15:06:57-08:00https://chrischoy.github.io/research/Regression-Classification<p>In Machine Learning, supervised problems can be categorized into regression or
classification problems. The categorization is quite intuitive as the name
indicate. For instance, if the output, or the target value is a continuous
value, the model tires to regress on the value; and if it is discrete, we want
to predict a discrete value as well. A well-known example of such
classification problem is binary classification such as spam vs. non-spam.
Stock price prediction, or temperature prediction would be good examples of
regression.</p>
<p>To solve such problems, we have to use different methods. First, for regression
problems, the most widely used approach is to minimize the L1 or L2
distance between our prediction and the ground truth target. For classification
problems, 1-vs-all SVMs, multinomial logistic regression, decision forest, or
minimizing the cross entropy are popular choices.</p>
<p>Due to their drastically different treatment, sometimes, it is easy to treat
them as a complete separate problems. However, we can think of the
classification problem as regression problem in a non-Euclidean space and
extend this concept to Wasserstein and Cramer distances.</p>
<h2 id="designing-an-objective-risk-function-for-machine-learning-models">Designing an Objective (Risk) Function for Machine Learning Models</h2>
<p>To make a system that behaves as we expect, we have to design a loss (risk)
function that captures the behavior that we would like to see and define the
<strong>Risk</strong> associated with failures, or the loss function.</p>
<p>For example, let’s look at a typical image classification problem where we
classify an image into a semantic class such as car, person etc. Most datasets
use a mapping from a string (“Car”) to a numeric value so that we can handle
the dataset in a computer easily. For instance, we can assign 0 to “Bird”; 1 to
“Person”; 2 to “Car” etc.</p>
<p>However, the numbers do not have intrinsic meaning. The annotators use such
numbering since it is easy to process on a computer; but not because “Person” +
“Person” gives your “Car” nor because a person is “greater” (>) than a bird.
So, in this case, making a machine learning model to regress such values that
do not have intrinsic meaning would not make much sense.</p>
<p>On the other hand, if the number that we are trying to predict has actual
physical meaning and ordering makes sense (e.g., price, weight, the intensity of
light (pixel value) etc.), it would be reasonable to use the numbers directly
for prediction.</p>
<p>To state this notion clearly, let $y$ be the target value (label, supervision)
associated to an input $x$ and $f(\cdot)$ be a (parametric) function or a
machine learning model. i.e. when we feed $x$ to the function $\hat{y}=f(x)$,
we want the output $\hat{y}$ to approximate the target value $y$. So we need a
measure of <strong>how different</strong> the generated values are from the supervision.
Naturally, we use a distance function to measure how close a target is to the
prediction and we use the distance as the loss function (objective function or
a risk function).</p>
$$
\begin{align}
L(x, y) & = D(\hat{y}, y) \\
\end{align}
$$
<p>where $D(\cdot, \cdot)$ denotes a distance function.</p>
<h2 id="regression-vs-classification">Regression vs. Classification</h2>
<p>Now let’s look at the simplest regression problem: linear regression using
least squares fitting. In this setting, we have noise observation around the
ground truth line, and our task is to estimate the line. In this case, $f(x) =
Ax + b$ and $D(a, b) = ||a - b||_2^2$, square of the L2 norm. This gives also
can be interpreted as the maximum likelihood estimation under Gaussian noise.
However, the L2 norm is not the only distance measure used in regression problems.
L1 norm is sometimes used to enforce sparsity, and the Huber loss is used for
regression problems where outliers do not follow the Gaussian distribution.</p>
$$
D(\hat{y}, y) = \|\hat{y} - y\|_2
$$
<p>Let’s go back to the previous classification problem. Regressing the arbitrary
numeric values (labels) clearly is not the best way to train a machine learning
model as the numeric values do not have intrinsic meaning. Instead, we can use
the probability distribution rather than the arbitrary numeric values. For
example, for an image of a bird, which was class 0, we assign $P_0 = 1$ and 0
for the others: $P_{bird} = [1, 0, 0]$ where the elements are the probability
of the input being a bird, person, and car respectively. Using this
representation, we can train multinomial logistic regression, multi-class SVM.</p>
<h2 id="cross-entropy-and-f-divergence">Cross Entropy and f-Divergence</h2>
<p>However, how should we measure the “distance” between the ground truth label
distribution and the prediction distribution? Or is there a concept of distance
between two distributions? One family of functions that measures the
difference is known as the <strong>Ali-Silvey distances</strong>, or more widely known as
<strong>f-divergence</strong>, provides a measure function. Specifically, one type of the
f-divergence family is more widely used than others, and it is the
Kullback-Leibler divergence. Formally, given two distributions $P_\hat{y}$ and
$P_y$, the KL divergence is defined as</p>
$$
\begin{align}
D(P_\hat{y} || P_y) & = \sum_{i \in \mathcal{Y}} P(\hat{y} = i) \log \frac{P(\hat{y} = i)}{P(y = i)} \\
& = - H(P_y) + H(P_\hat{y}, P_y)
\end{align}
$$
<p>where $H(\cdot)$ is the entropy and $H(\cdot, \cdot)$ is the cross entropy.
In classification problems, where $P_\hat{y}, P_y$ denote prediction and
ground truth respectively, the first term is a constant, so we drop the entropy
and train our prediction model with the cross entropy only. That’s where you
get your cross entropy loss.</p>
<p>However, the KL divergence is not the only divergence. In fact, any convex
function $f: (0, \infty) \rightarrow \mathbb{R}$ such that $f(1) = 0$ can
define a divergence function.</p>
$$
D_f(P || Q) = \mathbb{E}_Q \left[ f \left( \frac{dP}{dQ} \right) \right]
$$
<p>For example, if we use $f(x) = \frac{1}{2}|x - 1|$, we have the Total Variation divergence.</p>
$$
\begin{align}
D_{TV}(P || Q) & = \frac{1}{2} \mathbb{E}_Q \left[ \left| \frac{dP}{dQ} - 1 \right| \right] \\
& = \frac{1}{2} \int |P - Q| = \frac{1}{2} ||P - Q||_1
\end{align}
$$
<p>One thing to note is that the KL divergence is not a proper <em>metric</em> as it is
asymmetric and violates the triangle inequality.</p>
<h2 id="wasserstein-distance-cramer-distance">Wasserstein Distance, Cramer Distance</h2>
<p>However, f-divergence is not the only way to measure the difference between two
distributions. In <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, the authors propose that f-divergence does not capture
our regular notion of distance accurately and propose to use a different
distance and led an interesting discussion in adversarial training.
Let’s first look at other “distance” functions that do not belong to the
f-divergence family.</p>
<p>First, the Wasserstein distance, also known as the probabilistic Earth Mover’s
Distance, computes the minimum mass that we need to move to match a
probability distribution to another.
\begin{align}
W_1(P, Q) = \inf \mathbb{E} [|x - y|]
\end{align}
</p>
<p>The infimum is over the joint distribution whose marginals are $P$ and $Q$. $x$ and
$y$ are defined over the space where $P$ and $Q$ have non zero support. One
of great follow up works <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> proposed to use yet another different distance
function, Cramer
distance, to remove sampling bias in the distance function. The Cramer
distance is simply the squared version of it</p>
\begin{align}
W_2(P, Q) = \left( \inf \mathbb{E} [|x - y|^2] \right)^{1/2}
\end{align}
<h2 id="conclusion">Conclusion</h2>
<p>Categorizing supervised problems into classification or regression can help we clearly understand the
problem, but sometimes it can limit our imagination and also limit the set of distance
functions that we can use.
Rather, in this post, we discussed how classification and regression could be understood
from how we measure differences. Classification by measuring difference using
f-divergence or even probabilistic distances and regression as Euclidean
distances. They are merely distances that measure the difference between a target
and a prediction. There are more popular distance functions, but the
set of the distance function is not set in stone. Sometimes, by defining the
distance function in a clever way, we can improve our ML model!</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Arjovsky et al., Wasserstein Generative Adversarial Networks, 2017 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Bellemare et al., The Cramer Distance as a Solution to Biased Wasserstein Gradients, 2017 <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduIn Machine Learning, supervised problems can be categorized into regression or classification problems. The categorization is quite intuitive as the name indicate. For instance, if the output, or the target value is a continuous value, the model tires to regress on the value; and if it is discrete, we want to predict a discrete value as well. A well-known example of such classification problem is binary classification such as spam vs. non-spam. Stock price prediction, or temperature prediction would be good examples of regression.Data Processing Inequality and Unsurprising Implications2018-01-04T21:11:58-08:002018-01-04T21:11:58-08:00https://chrischoy.github.io/research/data-processing-inequality-and-unsurprising-implications<p>We have heard enough about the great success of neural networks and how they
are used in real problems. Today, I want to talk about how it was so successful
(partially) from an information theoretic perspective and some lessons that we all
should be aware of.</p>
<h2 id="traditional-feature-based-learning">Traditional Feature Based Learning</h2>
<p>Before we figured out how to train a large neural network efficiently and fast,
traditional methods (such as hand designed features + shallow models like a
random forest, SVMs) have dominated Computer Vision. As you have guessed,
traditional method first starts from extracting features from an image, such
as the Histogram of Oriented Gradients (HOG), or Scale-Invariant
Feature Transform (SIFT) features. Then, we use the supervised metood of our choice
to train the second part of the model for prediction. So, what we are learning
is only from the extracted feature the prediction.</p>
$$
\text{Image} \rightarrow \text{Features} \underset{f(\cdot; \theta)}{\rightarrow} \hat{y}
$$
<p>The information from the image is bottlenecked by the quality of the feature and thus
many research had led to better, faster features. Here, to illustrate that the
learnable parameters are only in the second stage, I put $\theta$ in a function
below the second arrow.</p>
<h2 id="neural-network-as-an-end-to-end-system">Neural Network as an End-to-End system</h2>
<p>Unlike the traditional approach, the neural network based method starts
directly from the original inputs (of course, some preprocessing like centering,
and normalization, but they are reversible). We assume that the neural network
is a universal function approximator and optimize the parameters inside it to
approximate a complex function like the color of pixels to a semantic class!</p>
$$
\text{Image} \underset{f(\cdot; \theta)}{\rightarrow} \hat{y}
$$
<p>Unlike before, we are making a system that does not involve an intermediate
representation. Then, the natural questions that follow are why such system is
strictly better than the one that involves intermediate representation?, and is
it always the case?</p>
<h2 id="data-processing-inequality">Data Processing Inequality</h2>
<p>To generalize our discussion, let’s assume $X, Y, Z$ be the random variables
that form a Markov chain.</p>
$$
X \rightarrow Y \rightarrow Z
$$
<p>You can think of each arrow as a complex system that generates the best approximation
of whatever we want for each step. According to the data processing inequality,
the mutual information between $X$ and $Z$, $I(X; Z)$ cannot be greater than
that between $X$ and $Y$, $I(X; Y)$.</p>
$$
I(X;Y) \ge I(X;Z)
$$
<p>In other words, the information can only be lost and never increases as we
process it. For example in the traditional method, we extract feature $Y$ from an image $X$ with
a deterministic function. Given the feature, we
estimate the outcome $Z$. So, if we lost some information from the first feature
extraction stage, we cannot regain the lost information from the second stage.</p>
<p>However, in an end-to-end system, we do not enforce an intermediate
representation and thus remove $Y$ altogether.</p>
<h2 id="case-studies">Case Studies</h2>
<p>Now that we are equipped with the knowledge, let’s delve into some scenarios where you
should swing your big knowledge around. Can you tell your friendly colleague
ML what went wrong or how to improve the model?</p>
<h3 id="case-1-rgb-rightarrow-thermal-image-rightarrow-pedestrian-detection">Case 1: RGB $\rightarrow$ Thermal Image $\rightarrow$ Pedestrian Detection</h3>
<p>ML wants to localize pedestrians from RGB images.</p>
<p>ML: It is easier to predict pedestrians from thermal images, but thermal
images are difficult to acquire as the thermal cameras are not as common as
regular RGB cameras. So I will first predict thermal images from regular
images, then it would be easier to find pedestrian.</p>
<h3 id="case-2-monocular-image-rightarrow-3d-shape-prediction-rightarrow-weight">Case 2: Monocular Image $\rightarrow$ 3D shape prediction $\rightarrow$ Weight</h3>
<p>Again, ML is working on weight prediction from a monocular image (just a
regular image).</p>
<p>ML: Weight is a property associated with the shape of the object. If we can
predict the shape of an object first from an image, then predicting weight from
a 3D shape would be easier!</p>
<p>You can guess what went wrong probably. However, if we slightly tweak the
setting, we could improve the model. For example, in the case 1, instead of
feeding the RGB image only, RGB + Thermal $\rightarrow$ Pedestrian Detection,
would easily improve the performance.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We discussed how the data processing inequality could shed light on the success
of the neural network and the importance of an end-to-end system. However,
problems that you want to solve might not be as clear-cut as I
illustrated here. There are a lot of hair-splitting details that make the
difference. However, it is always important to remind what is theoretically
possible and maybe such split-second thought could save you a week of
implementation!</p>Chris Choychrischoy@ai.stanford.eduWe have heard enough about the great success of neural networks and how they are used in real problems. Today, I want to talk about how it was so successful (partially) from an information theoretic perspective and some lessons that we all should be aware of.Learning Gaussian Process Covariances2017-12-15T14:49:11-08:002017-12-15T14:49:11-08:00https://chrischoy.github.io/research/learning-gaussian-process-covariances<p>A Gaussian process is a non-parametric model which can represent a complex
function using a growing set of data. Unlike a neural network, which can also
learn a complex functions, a Gaussian process can also provide variance
(uncertainty) of a data since the model is based on a simple Gaussian
distribution.</p>
<p>However, like many machine learning models, we have to define a set of
functions to define a Gaussian process. In a Gaussian process, the function of
uttermost importance is a covariance function. It is common to use a
predetermined function with fixed constants for a covariance, but it is more
pragmatic to learn a function rather than search high dimensional space using
sampling based methods to find the best set of parameters for the covariance
function.</p>
<p>In this post, I summarize a simple gradient based method and a scalable version
of learning a covariance function.</p>
<h2 id="brief-summary-of-a-gaussian-process">Brief Summary of a Gaussian Process</h2>
<p>In a Gaussian process, we assume that all observations are sample from a
Gaussian distribution and any subset of the random variables (observations or
predictions at a new data point) will follow a Gaussian distribution with
specific mean $m(\mathbf{x})$ and covariance $K(X, X)$.</p>
<p>Let $\mathcal{D} = { (\mathbf{x}_i, y_i) }_i^n$ be a dataset of of input
$\mathbf{x}_i$ and corresponding output $y_i$. We assume that the observation
is noisy and the noise free output at value $\mathbf{x}$ is
$\mathbf{f}(\mathbf{x})$, i.e., $\mathbf{y}(\mathbf{x}) =
\mathbf{f}(\mathbf{x}) + \epsilon$.</p>
<p>Given the Gaussian process assumption, all subsets follow a Gaussian distribution and thus, the entire dataset can be represented using a single Gaussian distribution.</p>
$$
\mathbf{f} | X, \mathbf{y} \sim \mathcal{N}(\mathbf{\bar{f}}, K(X, X))
$$
<p>Please refer to <a href="https://chrischoy.github.io/research/gaussian-process-regression/">the previous post</a> about a Gaussian process for details.</p>
<h2 id="learning-the-covariance-kx-x">Learning the Covariance $K(X, X)$</h2>
<p>In many cases, the covariance function $K(X, X)$ is predefined as a simple
function such as a squared exponential.
There are many variants of the function, but in its simplest form, the squared
exponential function contains at least two hyper-parameters, $c$ and $\sigma$</p>
$$
k(\mathbf{x}_1, \mathbf{x}_2) = c \exp \left( \frac{|\mathbf{x}_1 - \mathbf{x}_2|^2}{\sigma^2} \right).
$$
<p>We can use simple grid search or MCMC to find the optimal hyper-parameters.
However, as the function gets more complex, finding optimal hyper-parameters
can become a daunting task pretty quickly as the dimension gets larger.</p>
<p>Instead, we can use a simple gradient descent based method with multiple
initializations to find the optimal hyper-parameters.</p>
<h3 id="gradient-of-the-posterior-probability">Gradient of the Posterior Probability</h3>
<p>To take gradient steps w.r.t. hyper-parameters, we need to compute the
gradients w.r.t. hyper-parameters. Let all the hyper-parameters in a covariance
function as $\theta$.</p>
$$
\begin{align}
\log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \log |K| - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} (\mathbf{y} - \mathbf{\bar{f}}) + c\\
\nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) & = - \frac{1}{2} \mathrm{Tr} \left( K^{-1} \frac{\partial K}{\partial \theta_i} \right) - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}})
\end{align}
$$
<p>Given the gradient, we can use a gradient based optimizer of our choice to
learn the hyper-parameters (or simply parameters) of a Gaussian process.</p>
<h2 id="scalability-of-the-gradient">Scalability of the Gradient</h2>
<p>In the previous section, we assumed that we can compute the gradient exactly.
However, if the dimension of the vector $y$, $n$ increases, it might not be
possible to compute the above gradient in a reasonable time and cost. Let’s
analyze the computational complexity of each term.</p>
<p>First, note that $K^{-1}y$ requires solving a linear system which takes
$O(n^3)$ complexity if we use a decomposition based method or $O(\sqrt{\kappa}
n^2)$ if we use an iterative method like Conjugate Gradient, where $\kappa$ is
the condition number of $K$.</p>
<p>Now, we can compute the complexity of each term. The first term, $K^{-1}
\frac{\partial K}{\partial \theta_i}$, can take $O(\sqrt{\kappa} n^3)$ if we
use iterative method or $O(n^3)$ if we can cache decomposition. The second
term would only take $O(\sqrt{\kappa} n^2)$ as solving the linear system takes
the most time.</p>
<h2 id="sampling-the-gradient">Sampling the Gradient</h2>
<p>As the dimension of the problem gets larger, it would be impractical to solve
the system using a matrix decomposition and we need to resort to an approximate
method. The paper by Filippone and Engler <sup id="fnref:2"><a href="#fn:2" class="footnote">1</a></sup> propose to sample unbiased
gradient using i.i.d. $N_s$ vectors. For example, let $r^j$ be the $j$th
element of the vector $\mathbf{r}$. If we set $r^j \in {-1, 1}$ with equal
probability, $\mathbb{E}(\mathbf{r}\mathbf{r}^T) = I$ and</p>
$$
\begin{align}
\mathrm{Tr}\left(K^{-1}\frac{\partial K}{\partial \theta_i}\right)
& =\mathrm{Tr}\left( K^{-1}\frac{\partial K}{\partial \theta_i} \mathbb{E} \left[ \mathbf{r} \mathbf{r}^T \right] \right) \\
& = \mathbb{E} \left[ \mathbf{r}^T K^{-1}\frac{\partial K}{\partial \theta_i} \mathbf{r} \right]
\end{align}
$$
<p>We can solve $K^{-1}\mathbf{r}$ easily using Conjugate Gradient and thus, the
complexity of the above equation is $O(\sqrt{\kappa}n^2 N_s)$ where $N_s$ is
the number of samples. Finally, the gradient becomes</p>
$$
\nabla_{\theta_i} \log p(\mathbf{y}| X, \mathbf{\bar{f}}; \theta) \approx - \frac{1}{2N} \sum_i^N \mathbf{r}_i^T K^{-1} \frac{\partial K}{\partial \theta_i} \mathbf{r}_i - \frac{1}{2} (\mathbf{y} - \mathbf{\bar{f}})^T K^{-1} \frac{\partial K}{\partial \theta_i} K^{-1} (\mathbf{y} - \mathbf{\bar{f}})
$$
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we covered how to train a covariance function in a Gaussian
process using gradient based methods. As the method is not very scalable, we
also discussed how to use random samples to approximate the gradient.</p>
<h2 id="references">References</h2>
<div class="footnotes">
<ol>
<li id="fn:2">
<p><cite>M. Filippone and R. Engler, Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE), ICML’15</cite> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduA Gaussian process is a non-parametric model which can represent a complex function using a growing set of data. Unlike a neural network, which can also learn a complex functions, a Gaussian process can also provide variance (uncertainty) of a data since the model is based on a simple Gaussian distribution.DeformNet: Free-Form Deformation Network for 3D Shape Reconstruction from a Single Image2017-08-18T01:18:24-07:002017-08-18T01:18:24-07:00https://chrischoy.github.io/preprint/deformnet<h2 id="abstract">Abstract</h2>
<p>3D reconstruction from a single image is a key problem in multiple applications ranging from robotic manipulation to augmented reality. Prior methods have tackled this problem through generative models which predict 3D reconstructions as voxels or point clouds. However, these methods can be computationally expensive and miss fine shape details. We introduce a new differentiable layer for 3D data deformation and use it in DeformNet to learn free-form deformations usable on multiple 3D data formats. DeformNet takes an image input, searches the nearest shape template from the database, and deforms the template to match the query image. We evaluate our approach on the ShapeNet database and show that - (a) Free-Form Deformation is a powerful new building block for Deep Learning models that manipulate 3D data (b) DeformNet uses this FFD layer combined with shape retrieval for smooth and detail-preserving 3D reconstruction of qualitatively plausible point clouds with respect to a single query image (c) compared to other state-of-the-art 3D reconstruction methods, DeformNet quantitatively matches or outperforms their benchmarks by significant margins.</p>
<ul>
<li><a href="https://deformnet-site.github.io/DeformNet-website/">Project page</a></li>
<li><a href="https://arxiv.org/abs/1708.04672">ArXiv</a></li>
</ul>Andrey KurenkovAbstractWeakly Supervised 3D Reconstruction with Manifold Constraint2017-06-01T23:39:17-07:002017-06-01T23:39:17-07:00https://chrischoy.github.io/publication/weakly-supervised-reconstruction<h2 id="abstract">Abstract</h2>
<p>Volumetric 3D reconstruction has witnessed a significant progress in performance through the use of deep neural network based methods that address some of the limitations of traditional reconstruction algorithms. However, this increase in performance requires large scale annotations of 2D/3D data. This paper introduces a novel generative model for volumetric 3D reconstruction, Weakly supervised Generative Adversarial Network (WS-GAN) which reduces reliance on expensive 3D supervision. WS-GAN takes an input image, a sparse set of 2D object masks with respective camera parameters, and an unmatched 3D model as inputs during training. WS-GAN uses a learned encoding as input to a conditional 3D-model generator trained alongside a discriminator, which is constrained to the manifold of realistic 3D shapes. We bridge the representation gap between 2D masks and 3D volumes through a perspective raytrace pooling layer, that enables perspective projection and allows backpropagation. We evaluate WS-GAN on ShapeNet, ObjectNet and Stanford Online Product dataset for reconstruction with single-view and multi-view cases in both synthetic and real images. We compare our method to voxel carving and prior work with full 3D supervision. Additionally, we also demonstrate that the learned feature representation is semantically meaningful through interpolation and manipulation in input space.</p>
<ul>
<li><a href="https://arxiv.org/abs/1705.10904">ArXiv</a></li>
</ul>Christopher B. Choy*AbstractExpectation Maximization and Variational Inference (Part 2)2017-03-23T09:05:51-07:002017-03-23T09:05:51-07:00https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference-2<p>In the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference/">previous post</a>, we covered
variational inference and how to derive update equations. In this post, we will
go over a simple Gaussian Mixture Model with the Dirichlet prior distribution
over the mixture weight.</p>
<p>Let $x_n$ be a datum and $z_n$ be the latent variable that indicates the
assignment of the datum $x_n$ to a cluster $k$, $z_{nk} = I(z_n = k)$. We
denote the weight of a cluster $k$ with $\pi_k$ and the natural parameter of
the cluster as $\eta_k$.</p>
<p>The graphical model of the mixtures looks like the following.</p>
<figure>
<img style="width:30%" class="align-center" src="https://chrischoy.github.io/images/research/graphical_model.png" />
</figure>
<p>Formally, we define the generative process
$p(\pi|\alpha), p(z_n; \pi_0), p(x_n | z_z, \eta)$.
Unlike Bishop <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> and Blei et al. <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>, we will not use prior over the natural
parameter $\eta$ for simplicity. The notation and the model are similar to that
used in Blei et al. <sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup>. With overloading notation,</p>
$$
\begin{align}
p(\pi | \alpha_0) & = \mathrm{Dir}(\pi; \alpha_0) \\
p(z_n | \pi) & = \prod_k \pi_k^{z_{nk}} \\
p(x_n | z_n, \eta) & = \prod_k \mathcal{N}(x_n ; \eta_k)^{z_{nk}}
\end{align}
$$
<p>And the log joint probability is</p>
$$
\log p(\mathbf{x}, \mathbf{z} ; \eta, \alpha_0) = \sum_n \sum_k z_{nk} [\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \log \mathrm{Dir}(\pi; \alpha_0)
$$
<h2 id="meanfield-approximation">Meanfield Approximation</h2>
<p>In this example, let’s use the meanfield approximation and make the posterior
distribution of the latent variables $z$ and $\pi$ independent. i.e.</p>
$$
q(z, \pi) = q(z)q(\pi)
$$
<p>From the <a href="https://chrischoy.github.io/research/Expectation-Maximization-and-Variational-Inference/">previous post</a>, we know that
the optimal distribution $q(\cdot)$ that maximizes the evidence lower bound
is</p>
$$
\log q(w_i) = \mathbb{E}_{w_{j}, j\neq i} \log p(x, \mathbf{w})
$$
<p>where $w_i$ is an arbitrary latent variable. Thus, we can use the same
technique and find $q(z)$ and $q(\pi)$.</p>
$$
\begin{align*}
\log q(z) & = \sum_n \sum_k z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \mathbb{E}\log \mathrm{Dir}(\pi; \alpha_0) \\
& = \sum_n \sum_k z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + C_1 \\
\log q(\pi) & = \sum_n \sum_k \mathbb{E}z_{nk} [\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \log \mathrm{Dir}(\pi; \alpha_0) \\
& = \sum_n \sum_k \mathbb{E}z_{nk} \log \pi_k + \log \mathrm{Dir}(\pi; \alpha_0) + C_2
\end{align*}
$$
<p>We can easily compute the expectations of the latent variables.</p>
$$
\begin{align*}
\mathbb{E}\log \pi_k & = \psi(\alpha_k) - \psi(\sum_k \alpha_k) = \log \tilde{\pi}_k \\
\mathbb{E}z_{nk} & = q(z_{nk}=1) \propto \exp\left\{\log \tilde{\pi}_k + \log \mathcal{N}(x_n; \eta_k)\right\} = \rho_{nk} \\
\mathbb{E}z_{nk} & = \frac{\rho_{nk}}{\sum_l \rho_{nl}} = r_{nk}
\end{align*}
$$
<p>where $\alpha_k$ are the parameters of the latent variable $\pi_k$ and $\psi$
is the digamma function. We get the first equation from the property of the
Dirichlet distribution. Given the expectations, we can simplify the equations
and get update rules.</p>
<h2 id="expectation-and-maximization">Expectation and Maximization</h2>
<p>First, let’s examine the $\log q(\pi)$.</p>
$$
\begin{align*}
\log q(\pi) & = \sum_n \sum_k r_{nk} \log \pi_k + \log \mathrm{Dir}(\pi; \alpha_0) + C_2 \\
& = \sum_n \sum_k r_{nk} \log \pi_k + (\alpha_0 - 1) \log \pi_k + C_3 \\
& = \sum_k (\alpha_0 + \sum_n r_{nk} - 1) \log \pi_k + C_3 \\
& = \log \mathrm{Dir}(\pi| \alpha)
\end{align*}
$$
<p>Thus, $\alpha_k = \alpha_0 + \sum_n r_{nk}$. The $z$ update equation is given
above. Finally, for $\eta$, we differentiate $p(x;\eta)$ with respect to
$\eta$ to find the update rule.</p>
$$
\begin{align*}
\log p(x; \eta) & = \mathop{\mathbb{E}}_{z, \pi} \log p(x, z, \pi; \eta) \\
& = \sum_n \sum_k \mathbb{E} z_{nk} [\mathbb{E}\log \pi_k + \log \mathcal{N}(x_n ; \eta_k)] + \mathbb{E}\log \mathrm{Dir}(\pi; \alpha_0) \\
\nabla_{\eta_k} \log p(x; \eta) & = \sum_n r_{nk} \nabla_{\eta_k} \log \mathcal{N}(x_n ; \eta_k) \\
& = \sum_n r_{nk} \nabla_{\eta_k} \left( \frac{1}{2} \log |\Lambda_k| - \frac{1}{2} \mathrm{Tr}\left(\Lambda_k (x_n - \mu_n)(x_n - \mu_n)^T \right) \right) \\
\nabla_{\mu_k} \log p(x; \eta) & = \sum_n r_{nk} \Lambda_k (x_n - \mu_n) = 0 \\
\nabla_{\Lambda_k} \log p(x; \eta) & = \frac{1}{2} \sum_n r_{nk} \nabla_{\Lambda_k} \log |\Lambda_k| - r_{nk} \nabla_{\Lambda_k} \mathrm{Tr}\left(\Lambda_k (x_n - \mu_n)(x_n - \mu_n)^T \right) \\
& = \frac{1}{2} \sum_n r_{nk} \Lambda_k^{-1} - r_{nk} (x_n - \mu_n)(x_n - \mu_n)^T = 0 \\
\end{align*}
$$
<p>From the above equations, we can get</p>
$$
\begin{align}
N_k & = \sum_n r_{nk} \\
\mu_k & = \frac{1}{N_k} \sum_n r_{nk} x_n \\
\Lambda_k & = \frac{1}{N_k} \sum_n r_{nk} (x_n - \mu_k)(x_n - \mu_k)^T
\end{align}
$$
<h2 id="evidence-lower-bound">Evidence Lower Bound</h2>
<p>Given the final solutions $r_{nk}$, $\log \tilde{\pi}_k$, $\alpha’$, we can
derive the negative of the variational free energy, or the Evidence Lower Bound (ELBO).</p>
$$
\begin{align*}
ELBO & = \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x, z, \pi)}{q(z, \pi)} \\
& = \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x | z) p(z| \pi) p(\pi)}{q(z)q(\pi)} - \mathbb{E}_z\mathbb{E}_z \log q(z)q(\pi) \\
& = \underbrace{\mathbb{E}_z \log p(x | z)}_{\mbox{(a)}}
+ \underbrace{\mathbb{E}_z \mathbb{E}_\pi \log p(z | \pi) p(\pi) }_{\mbox{(b)}}
+ \underbrace{H(q(z))}_{\mbox{(c)}}
+ \underbrace{H(q(\pi))}_{\mbox{(d)}}
\end{align*}
$$
<p>where $H(\cdot)$ is the entropy. Each of the terms can be computed</p>
$$
\begin{align*}
\mbox{(a)} & = \mathbb{E}_z \log p(x | z) \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \sum_k z_{nk} \log \mathcal{N}_k(x_n) \\
& = \sum_n \sum_k r_{nk} \log \mathcal{N}_k(x_n) \\
\mbox{(b)} & = \mathbb{E}_z \mathbb{E}_\pi \log p(z | \pi) p(\pi) \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \log \frac{1}{B(\mathbb{\alpha}_0)} \prod_k \pi_k^{z_{nk}} \pi_k^{\alpha_0 - 1} \\
& = \mathbb{E}_z \mathbb{E}_\pi \sum_n \sum_k (z_{nk} + \alpha_0 - 1) \log \pi_k - \log B(\mathbb{\alpha}_0) \\
& = \sum_n \sum_k (\mathbb{E}_z z_{nk} + \alpha_0 - 1) \mathbb{E}_\pi \log \pi_k - \log B(\mathbb{\alpha}_0) \\
& = \sum_k \left( \sum_n r_{nk} + \alpha_0 - 1 \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) \\
\mbox{(c)} & = - \mathbb{E}_z \log q(z) \\
& = - \mathbb{E}_z \sum_n \sum_k z_{nk} \log r_{nk} \\
& = - \sum_n \sum_k r_{nk} \log r_{nk} \\
\mbox{(d)} & = - \mathbb{E}_\pi \log q(\pi) \\
& = - \mathbb{E}_\pi \log \frac{1}{B(\mathbb{\alpha}')} \prod_k \pi_k^{\alpha'_k - 1} \\
& = - \sum_k (\alpha'_k - 1) \log \mathbb{E}_\pi \pi_k + \log B(\mathbb{\alpha}') \\
& = - \sum_k (\alpha'_k - 1) \log \tilde{\pi}_k + \log B(\mathbb{\alpha}')
\end{align*}
$$
<p>Since $\log r_{nk} = \log \tilde{\pi}_k + \log \mathcal{N}_k(x_n) - \log \left( \sum_l \exp \{\log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right)$,</p>
$$
\begin{align*}
\mbox{(a) + (c)} & = \sum_n \sum_k r_{nk} \left(\log \mathcal{N_k}(x_n) - \log r_{nk} \right) \\
& = \sum_n \sum_k r_{nk} \left(- \log \tilde{\pi}_k + \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \right)\\
& = - \sum_k N_k \log \tilde{\pi}_k + \sum_n \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \\
\mbox{(b) + (d)} & = \sum_k \left( \sum_n r_{nk} + \alpha_0 - 1 \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) \\
& - \sum_k (\alpha'_k - 1) \log \tilde{\pi}_k + \log B(\mathbb{\alpha}') \\
& = \sum_k \left( \sum_n r_{nk} + \alpha_0 - \alpha'_k \right) \log \tilde{\pi}_k - \log B(\mathbb{\alpha}_0) + \log B(\mathbb{\alpha}') \\
& = \log B(\mathbb{\alpha}') - \log B(\mathbb{\alpha}_0)
\end{align*}
$$
<p>Thus,</p>
$$
\begin{align*}
ELBO = & \mathbb{E}_z \mathbb{E}_\pi \log \frac{p(x, z, \pi)}{q(z, \pi)} \\
= & - \sum_k N_k \log \tilde{\pi}_k + \sum_n \log \left( \sum_l \exp \{ \log \tilde{\pi}_l + \log \mathcal{N}_l(x_n) \} \right) \\
& + \log B(\mathbb{\alpha}') - \log B(\mathbb{\alpha}_0) \\
\end{align*}
$$
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Blei, <a href="http://www.cs.columbia.edu/~blei/papers/BleiJordan2004.pdf">Variational Inference for Dirichlet Process Mixtures, Bayesian Analysis 2006</a> <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduIn the previous post, we covered variational inference and how to derive update equations. In this post, we will go over a simple Gaussian Mixture Model with the Dirichlet prior distribution over the mixture weight.DESIRE: Deep Stochastic IOC RNN Encoder-decoder for Distant Future Prediction in Dynamic Scenes with Multiple Interacting Agents2017-03-14T02:22:45-07:002017-03-14T02:22:45-07:00https://chrischoy.github.io/publication/desire<h2 id="abstract">Abstract</h2>
<p>We introduce a Deep Stochastic IOC1 RNN Encoder- decoder framework, DESIRE, with a conditional Variational Auto-Encoder and multiple RNNs for the task of future predictions of multiple interacting agents in dynamic scenes. Accurately predicting the location of objects in the future is an extremely challenging task. An effective prediction model must be able to 1) account for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) fore-see the potential future outcomes and make a strategic prediction based on that, and 3) reason not only from the past motion history, but also from the scene context as well as the interactions among the agents.
DESIRE can address all aforementioned challenges in a single end-to-end trainable neural network model, while being computationally efficient. The model first obtains a diverse set of hypothetical future prediction samples employing a conditional variational auto-encoder, which are ranked and refined via the following RNN scoring-regression module. We evaluate our model on two publicly available datasets: KITTI and Stanford Drone Dataset. Our experiments show that the proposed model significantly improves the prediction accuracy compared to other baseline methods.</p>Namhoon LeeAbstractScene Graph Generation by Iterative Message Passing2017-03-14T02:22:45-07:002017-03-14T02:22:45-07:00https://chrischoy.github.io/publication/scene-graph<h2 id="abstract">Abstract</h2>
<p>Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel end-to-end model that generates such structured scene representation from an input image. The model solves the scene graph inference problem using standard RNNs and learns to iteratively improves its predictions via message passing. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods on generating scene graphs using Visual Genome dataset and inferring support relations with NYU Depth v2 dataset.</p>
<ul>
<li><a href="https://arxiv.org/abs/1701.02426">ArXiv</a></li>
</ul>Danfei XuAbstract