Jekyll2020-05-19T19:59:27-07:00https://chrischoy.github.io/feed.xmlComputer Vision'er'Chris Choychrischoy@ai.stanford.eduDeep Global Registration2020-04-23T23:00:36-07:002020-04-23T23:00:36-07:00https://chrischoy.github.io/publication/dgr<p><img src="https://chrischoy.github.io/images/publication/dgr/vid100.gif" alt="Pipeline" /></p>
<h2 id="abstract">Abstract</h2>
<p>We present Deep Global Registration, a differentiable framework for pairwise registration of real-world 3D scans. Deep global registration is based on three modules: a 6-dimensional convolutional network for correspondence confidence prediction, a differentiable Weighted Procrustes algorithm for closed-form pose estimation, and a robust gradient-based SE(3) optimizer for pose refinement. Experiments demonstrate that our approach outperforms stateof-the-art methods, both learning-based and classical, on real-world data.</p>
<h2 id="paper">Paper</h2>
<p><a class="paper-thumbnail" href="https://node1.chrischoy.org/data/publications/dgr/DGR.pdf">
<img src="https://chrischoy.github.io/images/publication/dgr/paper-0.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-1.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-2.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-3.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-4.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-5.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-6.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-7.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-8.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-9.png" />
<img src="https://chrischoy.github.io/images/publication/dgr/paper-10.png" />
</a></p>
<p><a href="https://node1.chrischoy.org/data/publications/dgr/DGR.pdf">paper</a></p>
<h2 id="oral-presentation">Oral Presentation</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/Iy17wvo07BU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="1-min-video">1-min Video</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/stzgn6DkozA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="quick-pipeline-video">Quick Pipeline Video</h2>
<p><img src="https://chrischoy.github.io/images/publication/dgr/text_100.gif" alt="Pipeline" /></p>
<h2 id="supplementary-materials">Supplementary Materials</h2>
<ul>
<li><a href="https://node1.chrischoy.org/data/publications/dgr/DGR_supp.pdf">Supplementary paper</a></li>
<li><a href="https://github.com/chrischoy/DeepGlobalRegistration">Code</a></li>
<li><a href="https://chrischoy.github.io/images/publication/dgr/DGR_Poster_CVPR20.pdf">Poster</a></li>
</ul>
<figure>
<a href="https://chrischoy.github.io/images/publication/dgr/DGR_Poster_CVPR20.pdf"><img src="https://chrischoy.github.io/images/publication/dgr/dgr_poster.png" /></a>
</figure>
<ul>
<li>KITTI registration visualizataion</li>
</ul>
<p><img src="https://chrischoy.github.io/images/publication/dgr/kitti1_optimized.gif" alt="" /></p>
<h2 id="registration-results">Registration Results</h2>
<ul>
<li>All registration results of every 100th frames of the 3DMatch benchmark</li>
</ul>
<iframe width="560" height="315" src="https://www.youtube.com/embed/kZfj3N4g8w8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="bibtex">Bibtex</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{choy2020deep,
title={Deep Global Registration},
author={Choy, Christopher and Dong, Wei and Koltun, Vladlen},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2020}
}
</code></pre></div></div>Christopher ChoyHigh-dimensional Convolutional Networks for Geometric Pattern Recognition2020-04-23T23:00:36-07:002020-04-23T23:00:36-07:00https://chrischoy.github.io/publication/highdimconvnet<h2 id="abstract">Abstract</h2>
<p>Many problems in science and engineering can be formulated in terms of geometric patterns in high-dimensional spaces. We present high-dimensional convolutional networks (ConvNets) for pattern recognition problems that arise in the context of geometric registration. We first study the effectiveness of convolutional networks in detecting linear subspaces in high-dimensional spaces with up to 32 dimensions: much higher dimensionality than prior applications of ConvNets. We then apply high-dimensional ConvNets to 3D registration under rigid motions and image correspondence estimation. Experiments indicate that our high-dimensional ConvNets outperform prior approaches that relied on deep networks based on global pooling operators.</p>
<h2 id="paper">Paper</h2>
<p><a class="paper-thumbnail" href="https://node1.chrischoy.org/data/publications/highdimconvnets/highdimconvnets.pdf">
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-0.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-1.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-2.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-3.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-4.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-5.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-6.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-7.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-8.png" />
<img src="https://chrischoy.github.io/images/publication/highdimconvnets/paper-9.png" />
</a></p>
<p><a href="https://node1.chrischoy.org/data/publications/highdimconvnets/highdimconvnets.pdf">paper</a></p>
<h2 id="supplementary-materials">Supplementary Materials</h2>
<ul>
<li><a href="https://github.com/chrischoy/HighDimConvNets">Code</a></li>
</ul>
<h2 id="oral-presentation">Oral Presentation</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/bsPGPRrAJOY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="1-min-video">1-min Video</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/IJI0uUjmQPA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="bibtex">Bibtex</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{choy2020high,
title={High-dimensional Convolutional Networks for Geometric Pattern Recognition},
author={Choy, Christopher and Lee, Junha and Ranftl, Rene and Park, Jaesik and Koltun, Vladlen},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2020}
}
</code></pre></div></div>Christopher ChoyAbstractHigh dimensional Convolutional Neural Networks for 3D Perception2020-04-07T12:14:36-07:002020-04-07T12:14:36-07:00https://chrischoy.github.io/thesis/thesis<h2 id="abstract">Abstract</h2>
<p>The automation of mechanical tasks brought the modern world unprecedented
prosperity and comfort. However, the majority of automated tasks have been
simple mechanical tasks that only require repetitive motion. Tasks that require
visual perception and high-level cognition still have become the last frontiers
of automation. Many of these tasks require visual perception such as automated
warehouses where robots package items in disarray, autonomous driving where
autonomous agents localize themselves, identify and track other dynamic objects
in the 3D world. This ability to represent, identify, and interpret visual
three-dimensional data to understand the underlying three-dimensional structure
in the real world is known as 3D perception. In this dissertation, we propose
learning-based approaches to tackle challenges in 3D perception. Specifically,
we propose a set of high-dimensional convolutional neural
networks for three categories of problems in 3D perception: reconstruction,
representation learning, and registration.</p>
<p>Reconstruction is the first step that generates 3D point clouds or meshes from
a set of sensory inputs. We present supervised reconstruction methods using 3D
convolutional neural networks that take a set of images as input and generate
3D occupancy patterns in a grid as output. We train the networks with a large-scale 3D
shape dataset to generate a set of images rendered from various viewpoints
validate the approach on real image datasets.
However, supervised reconstruction requires 3D shapes as labels for all images, which are expensive to generate.
Instead, we propose using a set of foreground masks and unlabeled real 3D shapes to train the reconstruction network as weaker supervision.
Combined with the learned constraint, we train the reconstruction system with as few as 1 image and show that the
proposed model without direct 3D supervision.</p>
<p>In the second part of the dissertation, we present sparse tensor networks,
neural networks for spatially sparse tensors. As we increase the spatial
dimension, the sparsity of input data decreases drastically as the volume of
the space increases exponentially. Sparse tensor networks exploit such inherent
sparsity in the input data and efficiently process them.
With the sparse tensor network, we create a 4-dimensional convolutional network for spatio-temporal perception for 3D scans or a sequence of 3D scans (3D video).
We show that 4-dimensional convolutional neural networks can effectively make use of temporal consistency and improve the accuracy of segmentation.
Next, we use the
sparse tensor networks for geometric representation learning to capture both local and global 3D structures accurately for correspondences and registration. We
propose fully convolutional networks and new types of metric learning losses that allow neurons to capture large context while capturing local spatial geometry.
We experimentally validate our approach on both indoor and outdoor datasets and show that the network outperforms the state-of-the-art method while being a few orders of magnitude faster.</p>
<p>In the third and the last part of the dissertation, we discuss high-dimensional pattern
recognition problems in image and 3D registration. We first propose
high-dimensional convolutional networks from 4 to 32-dimensional spaces and analyze the
geometric pattern recognition capacity of these high-dimensional convolutional
networks for linear regression problems.
Next, we show that the 3D correspondences form a hyper-surface in
6-dimensional space; and 2D correspondences form a 4-dimensional hyper-conic section,
which we detect using high-dimensional convolutional networks. We extend the
proposed high-dimensional convolutional networks for differentiable 3D registration and
propose three core modules for this: a 6-dimensional convolutional neural
network for correspondence confidence prediction; a differentiable Weighted
Procrustes method for closed-form pose estimation; and a robust gradient-based
3D rigid transformation optimizer for pose refinement. Experiments demonstrate that our
approach outperforms state-of-the-art learning-based and classical methods
on real-world data while maintaining efficiency.</p>
<h2 id="thesis">Thesis</h2>
<p>The thesis is posted on the Stanford Digital Repository: <a href="http://purl.stanford.edu/fg022dx0979">Thesis</a>.</p>
<h2 id="chapters">Chapters</h2>
<p>You can access each chapter without downloading the full thesis from the following list.</p>
<ul>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch1_introduction.pdf">Chapter 1: Introduction</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch2_supervised_reconstruction.pdf">Chapter 2: Supervised Reconstruction</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch3_weakly_supervised_reconstruction.pdf">Chapter 3: Weakly-supervised Reconstruction</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch4_sparse_tensor_network.pdf">Chapter 4: Sparse Tensor Networks</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch5_spatio_temporal_segmentation.pdf">Chapter 5: Spatio-Temporal Segmentation</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch6_geometric_features.pdf">Chapter 6: Geometric Features</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch7_geometric_pattern_recognition.pdf">Chapter 7: Geometric Pattern Recognition</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch8_global_registration.pdf">Chapter 8: Global Registration</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/thesis/ch9_conclusion.pdf">Chapter 9: Conclusion</a></li>
</ul>
<h2 id="thesis-defense-slides">Thesis Defense Slides</h2>
<p>Slides for my PhD oral defense are available at: <a href="https://node1.chrischoy.org/data/publications/thesis/slides.pdf">Slides</a></p>Christopher ChoyAbstractFully Convolutional Geometric Features2019-07-22T11:40:51-07:002019-07-22T11:40:51-07:00https://chrischoy.github.io/publication/fcgf<p><img src="https://raw.githubusercontent.com/chrischoy/FCGF/master/assets/fps_acc.png" alt="Comparison" /></p>
<p>Speed vs. Accuracy Pareto optimal frontier of previous methods and ours.</p>
<h2 id="abstract">Abstract</h2>
<p>Extracting geometric features from 3D scans or point clouds is the first step in applications such as registration, reconstruction, and tracking. State-of-the-art methods require computing low-level features as input or extracting patch-based features with limited receptive field. In this work, we present fully-convolutional geometric features, computed in a single pass by a 3D fully-convolutional network. We also present new metric learning losses that dramatically improve performance. Fully-convolutional geometric features are compact, capture broad spatial context, and scale to large scenes. We experimentally validate our approach on both indoor and outdoor datasets. Fully-convolutional geometric features achieve state-of-the-art accuracy without requiring prepossessing, are compact (32 dimensions), and are 600 times faster than the most accurate prior method.</p>
<p><img src="https://chrischoy.github.io/images/publication/fcgf/table.png" alt="Comparison" /></p>
<h2 id="paper">Paper</h2>
<p><a class="paper-thumbnail" href="https://node1.chrischoy.org/data/publications/fcgf/fcgf.pdf">
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-0.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-1.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-2.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-3.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-4.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-5.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-6.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-7.png" />
<img src="https://chrischoy.github.io/images/publication/fcgf/thumb-8.png" />
</a></p>
<p><a href="https://node1.chrischoy.org/data/publications/fcgf/fcgf.pdf">paper</a></p>
<h2 id="supplementary-materials">Supplementary Materials</h2>
<ul>
<li><a href="https://github.com/chrischoy/FCGF">Github</a></li>
<li><a href="https://node1.chrischoy.org/data/publications/fcgf/fcgf_supp.pdf">Supplementary material</a></li>
<li>Visualization of correspondences
<iframe width="560" height="315" src="https://www.youtube.com/embed/d0p0eTaB50k" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>Visualization of 500 randomly subsampled correspondences out of ~5k correspondences.</p>
</li>
</ul>
<h2 id="bibtex">Bibtex</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{choy2019fully,
title={Fully Convolutional Geometric Features},
author={Choy, Christopher and Park, Jaesik and Koltun, Vladlen},
booktitle={Proceedings of the IEEE International Conference on Computer Vision},
pages={8958--8966},
year={2019}
}
</code></pre></div></div>Christopher ChoyPytorch Extension with a Makefile2018-12-28T15:04:17-08:002018-12-28T15:04:17-08:00https://chrischoy.github.io/research/pytorch-extension-with-makefile<p>Pytorch is a great neural network library that has both flexibility and power. Personally, I think it is the best neural network library for prototyping (advanced) dynamic neural networks fast and deploying it to applications.</p>
<p>Recently, pytorch was upgraded to version 1.0 and introduced the ATen tensor library for all backend and c++ custom extension. Before the c++ extension, it supported CFFI (C Foreign Function Import) for a custom extension.</p>
<p>As an avid CUDA developer, I created multiple projects to speed up custom pytorch layers using the CFFI interface. However, wrapping functions with a non object oriented program (C) sometimes led to a ridiculous overhead when complex objects are required. Now that it supports the latest technology from 2011, c++11, we can now use object oriented programming for pytorch extensions!</p>
<p>In this tutorial, I will cover soe drawbacks of the current setuptools and will show you how to use a Makefile for pytorch cpp extension development. The source codes for the tutorial can be found <a href="https://github.com/chrischoy/MakePytorchPlusPlus">here</a>.</p>
<p>Before you proceed, please read the <a href="https://pytorch.org/tutorials/advanced/cpp_extension.html">official Pytorch CPP extension guide</a> which provides an extensive and useful tutorial for how to create a C++ extension with ATen.</p>
<h2 id="drawbacks-of-setuptools-for-development">Drawbacks of Setuptools for Development</h2>
<p>However, the <code class="language-plaintext highlighter-rouge">setuptool</code> is not really flexible as it primarily focuses on the deployment of a project. Thus, it lacks a lot of features that are essential for fast development. Let’s delve into few scenarios that I encountered while I was porting my pytorch cffi extensions to cpp extensions.</p>
<h3 id="compile-only-updated-files">Compile only updated files</h3>
<p>When you develop a huge project, you don’t want to compile the entire project everytime you make a small change. However, if you use the setuptool, it creates objects for ALL source files, everytime you make a change. This becomes extremely cumbersome especially when your project gets larger.</p>
<p>However, <code class="language-plaintext highlighter-rouge">Makefile</code> allows you to cache all object files as you have control over all files and compile only the files that are updated. This is extremely useful if you made a small change to one file and want to quickly debug your project.</p>
<h3 id="parallel-compilation">Parallel Compilation</h3>
<p>Another problem with the setuptool is that it compiles files sequentially. When your project gets huge, you might want to compile a lot of files in parallel. With a Makefile, you can parallelize compilation with the <code class="language-plaintext highlighter-rouge">-j#</code> flag. For example, if you type <code class="language-plaintext highlighter-rouge">make -j8</code>, it would compile 8 files in parallel.</p>
<h3 id="debugging">Debugging</h3>
<p>The current pytorch c++ extension does not allow debugging even with the debug flag. Instead, with a Makefile, you could pass <code class="language-plaintext highlighter-rouge">-g</code> (or <code class="language-plaintext highlighter-rouge">-g -G</code> for nvcc) with ease.
In the <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/Makefile">Makefile</a>, uncomment the line 3 <code class="language-plaintext highlighter-rouge">DEBUG=1</code> and the line 20 of <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/setup.py">setup.py</a>.</p>
<h2 id="making-a-pytorch-extension-with-a-makefile">Making a pytorch extension with a Makefile</h2>
<p>Now that we covered some of advantages of using a Makefile for a pytorch cpp extension, let’s get into the details of how to make one.</p>
<h3 id="creating-objects-and-functions">Creating Objects and Functions</h3>
<p>As an example, in this tutorial, we will create a class and a cuda function that are callable in python. First, let’s make a simple class that provides a setter and a getter for a private variable <code class="language-plaintext highlighter-rouge">key_</code>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Foo</span> <span class="p">{</span>
<span class="nl">private:</span>
<span class="kt">uint64_t</span> <span class="n">key_</span><span class="p">;</span>
<span class="nl">public:</span>
<span class="kt">void</span> <span class="n">setKey</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">key</span><span class="p">);</span>
<span class="kt">uint64_t</span> <span class="n">getKey</span><span class="p">();</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">toString</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span>
<span class="k">return</span> <span class="s">"< Foo, key: "</span> <span class="o">+</span> <span class="n">std</span><span class="o">::</span><span class="n">to_string</span><span class="p">(</span><span class="n">key_</span><span class="p">)</span> <span class="o">+</span> <span class="s">" > "</span><span class="p">;</span>
<span class="p">};</span>
<span class="p">};</span>
</code></pre></div></div>
<p>We will fill out the setter and the getter functions in <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/src/foo.cpp">foo.cpp</a>. Next, I created a simple CUDA function that adds two vectors and returns results in a new <code class="language-plaintext highlighter-rouge">at::Tensor</code>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">Dtype</span><span class="o">></span>
<span class="n">__global__</span> <span class="kt">void</span> <span class="nf">sum</span><span class="p">(</span><span class="n">Dtype</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">int</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o"><=</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
<span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">Dtype</span><span class="o">></span>
<span class="kt">void</span> <span class="nf">AddGPUKernel</span><span class="p">(</span><span class="n">Dtype</span> <span class="o">*</span><span class="n">in_a</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">in_b</span><span class="p">,</span> <span class="n">Dtype</span> <span class="o">*</span><span class="n">out_c</span><span class="p">,</span> <span class="kt">int</span> <span class="n">N</span><span class="p">,</span>
<span class="n">cudaStream_t</span> <span class="n">stream</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span><span class="o"><</span><span class="n">Dtype</span><span class="o">></span>
<span class="o"><<<</span><span class="n">GET_BLOCKS</span><span class="p">(</span><span class="n">N</span><span class="p">),</span> <span class="n">CUDA_NUM_THREADS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">stream</span><span class="o">>>></span><span class="p">(</span><span class="n">in_a</span><span class="p">,</span> <span class="n">in_b</span><span class="p">,</span> <span class="n">out_c</span><span class="p">,</span> <span class="n">N</span><span class="p">);</span>
<span class="n">cudaError_t</span> <span class="n">err</span> <span class="o">=</span> <span class="n">cudaGetLastError</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cudaSuccess</span> <span class="o">!=</span> <span class="n">err</span><span class="p">)</span>
<span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">(</span><span class="n">Formatter</span><span class="p">()</span>
<span class="o"><<</span> <span class="s">"CUDA kernel failed : "</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">to_string</span><span class="p">(</span><span class="n">err</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note that I call <code class="language-plaintext highlighter-rouge">std::runtime_error</code> when it gives an error. Pybind11 automatically converts <code class="language-plaintext highlighter-rouge">std</code> exceptions to python exception types. For example, the <code class="language-plaintext highlighter-rouge">std::runtime_error</code> will be mapped to <code class="language-plaintext highlighter-rouge">RuntimeError</code> in python. This prevents the system from crashing and allows the python to handle errors gracefully. More error handling with pybind11 can be found at <a href="https://pybind11.readthedocs.io/en/stable/advanced/exceptions.html">here</a>.</p>
<h3 id="bridging-cpp-with-pybind">Bridging CPP with Pybind</h3>
<p>Pytorch passes tensors as the <code class="language-plaintext highlighter-rouge">at::Tensor</code> type. To extract the mutable raw pointer, use <code class="language-plaintext highlighter-rouge">.data<Dtype>()</code>. For example, if you want to extract the raw pointer from a variable <code class="language-plaintext highlighter-rouge">A</code> of type <code class="language-plaintext highlighter-rouge">float</code>, use <code class="language-plaintext highlighter-rouge">A.data<float>()</code>. In addition, if you want to use the CUDA stream for the current context, use the function <code class="language-plaintext highlighter-rouge">at::cuda::getCurrentCUDAStream()</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>template <typename Dtype>
void AddGPU(at::Tensor in_a, at::Tensor in_b, at::Tensor out_c) {
int N = in_a.numel();
if (N != in_b.numel())
throw std::invalid_argument(Formatter()
<< "Size mismatch A.numel(): " << in_a.numel()
<< ", B.numel(): " << in_b.numel());
out_c.resize_({N});
AddGPUKernel<Dtype>(in_a.data<Dtype>(), in_b.data<Dtype>(),
out_c.data<Dtype>(), N, at::cuda::getCurrentCUDAStream());
}
</code></pre></div></div>
<p>The above function can be directly called from python with pybind11. Now, let’s bind the cpp function and the class with python.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">namespace</span> <span class="n">py</span> <span class="o">=</span> <span class="n">pybind11</span><span class="p">;</span>
<span class="n">PYBIND11_MODULE</span><span class="p">(</span><span class="n">TORCH_EXTENSION_NAME</span><span class="p">,</span> <span class="n">m</span><span class="p">){</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"Foo"</span><span class="p">);</span>
<span class="n">py</span><span class="o">::</span><span class="n">class_</span><span class="o"><</span><span class="n">Foo</span><span class="o">></span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">name</span><span class="p">.</span><span class="n">c_str</span><span class="p">())</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="n">py</span><span class="o">::</span><span class="n">init</span><span class="o"><></span><span class="p">())</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"setKey"</span><span class="p">,</span> <span class="o">&</span><span class="n">Foo</span><span class="o">::</span><span class="n">setKey</span><span class="p">)</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"getKey"</span><span class="p">,</span> <span class="o">&</span><span class="n">Foo</span><span class="o">::</span><span class="n">getKey</span><span class="p">)</span>
<span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"__repr__"</span><span class="p">,</span> <span class="p">[](</span><span class="k">const</span> <span class="n">Foo</span> <span class="o">&</span><span class="n">a</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">a</span><span class="p">.</span><span class="n">toString</span><span class="p">();</span> <span class="p">});</span>
<span class="n">m</span><span class="p">.</span><span class="n">def</span><span class="p">(</span><span class="s">"AddGPU"</span><span class="p">,</span> <span class="o">&</span><span class="n">AddGPU</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For classes, need to use <code class="language-plaintext highlighter-rouge">py::class_<CLASS></code> to let the pybind to know it is a class. Then, define functions that you want to expose to python with <code class="language-plaintext highlighter-rouge">.def</code>.
For functions, you can simply attach the function using <code class="language-plaintext highlighter-rouge">.def</code> directly.</p>
<h3 id="compiling-the-project">Compiling the project</h3>
<p>Now that we have all source files ready, let’s compile them. First, we will make an archive library that contains all classes and functions. Then, we can compile the file that bind all functions and classes with pybind11 with the setuptools and load it in python.</p>
<h4 id="finding-the-arguments-and-the-include-paths">Finding the Arguments and the Include Paths</h4>
<p>First, we have to find the right arguments used to compile the pytorch extension. It is actually easy to find since when you compile your project using the setuptools, you can see the actual compilation command that it invokes and we can deduce what would be required to make a project from a Makefile. Extra arguments that it uses for pytorch extensions are</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=$(EXTENSION_NAME) -D_GLIBCXX_USE_CXX11_ABI=0
</code></pre></div></div>
<p>In addition, we need to find headers. The pytorch provides CPP extensions with setuptools and we could see how it finds the headers and libraries. In <code class="language-plaintext highlighter-rouge">torch.utils.cpp_extension</code> you can find the function <code class="language-plaintext highlighter-rouge">include_paths</code>, which provides all header paths. We only need to pass it to the Makefile. Within a Makefile, we can run a python command and get the paths like the following. (2019-07-03 Pytorch now supports the ABI flag explicitly. See below.)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PYTHON_HEADER_DIR := $(shell python -c 'from distutils.sysconfig import get_python_inc; print(get_python_inc())')
</code></pre></div></div>
<p>Note that the command above prints out all paths line by line, so, in the end, we can iterate over the paths in the Makefile to prepend <code class="language-plaintext highlighter-rouge">-I</code>.
The final makefile can be found at <a href="https://github.com/chrischoy/MakePytorchPlusPlus/blob/master/Makefile">here</a>.</p>
<h4 id="archive-libraries">Archive Libraries</h4>
<p>Once we build library files, we create an archive file and link it to the main pybind entries.
We can do so by</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ar rc $(STATIC_LIB) $(OBJS) $(CU_OBJS)
</code></pre></div></div>
<h4 id="compiling-the-bind-file">Compiling the bind file</h4>
<p>When the archive library is ready, we can finally compile the bind file that will linke the classes and functions with the binding.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">setuptools</span> <span class="kn">import</span> <span class="n">setup</span>
<span class="n">setup</span><span class="p">(</span>
<span class="o">...</span>
<span class="n">ext_modules</span><span class="o">=</span><span class="p">[</span>
<span class="n">CUDAExtension</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s">'MakePytorchBackend'</span><span class="p">,</span>
<span class="n">include_dirs</span><span class="o">=</span><span class="p">[</span><span class="s">'./'</span><span class="p">],</span>
<span class="n">sources</span><span class="o">=</span><span class="p">[</span>
<span class="s">'pybind/bind.cpp'</span><span class="p">,</span>
<span class="p">],</span>
<span class="n">libraries</span><span class="o">=</span><span class="p">[</span><span class="s">'make_pytorch'</span><span class="p">],</span>
<span class="n">library_dirs</span><span class="o">=</span><span class="p">[</span><span class="s">'objs'</span><span class="p">],</span>
<span class="c1"># extra_compile_args=['-g']
</span> <span class="p">)</span>
<span class="p">],</span>
<span class="n">cmdclass</span><span class="o">=</span><span class="p">{</span><span class="s">'build_ext'</span><span class="p">:</span> <span class="n">BuildExtension</span><span class="p">},</span>
<span class="o">...</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Finally, if we automatically call the <code class="language-plaintext highlighter-rouge">setup.py</code> in python, we can only issue <code class="language-plaintext highlighter-rouge">make -j8</code> to compile all files, binding and install it in the python library.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>all: $(STATIC_LIB)
python setup.py install --force
</code></pre></div></div>
<h2 id="update-2019-07-03">Update 2019-07-03</h2>
<p>Pytorch v1.1 now provides ABI flag explicitly. You can access the ABI flag using <code class="language-plaintext highlighter-rouge">torch._C._GLIBCXX_USE_CXX11_ABI</code>. You may refer to the use case in Makefile on the <a href="https://github.com/StanfordVL/MinkowskiEngine/blob/master/Makefile#L15">MinkowskiEngine Makefile</a>.</p>Chris Choychrischoy@ai.stanford.eduPytorch is a great neural network library that has both flexibility and power. Personally, I think it is the best neural network library for prototyping (advanced) dynamic neural networks fast and deploying it to applications.4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks2018-12-25T23:09:01-08:002018-12-25T23:09:01-08:00https://chrischoy.github.io/publication/minkowskinet<h2 id="abstract">Abstract</h2>
<p>In many robotics and VR/AR applications, 3D-videos are readily-available sources of input (a continuous sequence of depth images, or LIDAR scans). However, those 3D-videos are processed frame-by-frame either through 2D convnets or 3D perception algorithms. In this work, we propose 4-dimensional convolutional neural networks for spatio-temporal perception that can directly process such 3D-videos using high-dimensional convolutions. For this, we adopt sparse tensors and propose the generalized sparse convolution that encompasses all discrete convolutions. To implement the generalized sparse convolution, we create an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks. We create 4D spatio-temporal convolutional neural networks using the library and validate them on various 3D semantic segmentation benchmarks and proposed 4D datasets for 3D-video perception. To overcome challenges in the 4D space, we propose the hybrid kernel, a special case of the generalized sparse convolution, and the trilateral-stationary conditional random field that enforces spatio-temporal consistency in the 7D space-time-chroma space. Experimentally, we show that convolutional neural networks with only generalized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods by a large margin. Also, we show that on 3D-videos, 4D spatio-temporal convolutional neural networks are robust to noise, outperform 3D convolutional neural networks and are faster than the 3D counterpart in some cases.</p>
<h2 id="videos">Videos</h2>
<h3 id="visualizations-of-scannet-input-prediction-ground-truth-and-ground-truth---prediction">Visualizations of ScanNet input, prediction, ground truth, and (ground truth - prediction)</h3>
<iframe width="560" height="315" src="https://www.youtube.com/embed/aB2dupOhgJk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h3 id="visualizations-of-synthia-input-prediction-ground-truth-and-ground-truth---prediction">Visualizations of Synthia input, prediction, ground truth, and (ground truth - prediction)</h3>
<iframe width="560" height="315" src="https://www.youtube.com/embed/jal-qQ6exm8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="scannet-semantic-segmentation-challenge">ScanNet Semantic Segmentation Challenge</h2>
<ul>
<li>The Minkowski Net was the <a href="http://www.scan-net.org/cvpr2019workshop/">Winner of the 2019 ScanNet Semantic Segmentation Challenge</a></li>
</ul>
<h2 id="our-paper-on-media">Our paper on Media</h2>
<ul>
<li><a href="https://news.developer.nvidia.com/nvidia-nvail-partners-present-their-research-at-cvpr-2019/">Nvidia AI Blog</a></li>
</ul>
<h2 id="external-links">External links</h2>
<ul>
<li><a href="https://arxiv.org/abs/1904.08755">paper on arXiv</a></li>
<li>4D Spatio Temporal Segmentation: <a href="https://github.com/chrischoy/SpatioTemporalSegmentation">Code</a></li>
<li>Pretrained weights: <a href="https://github.com/chrischoy/SpatioTemporalSegmentation#model-zoo">Model Zoo</a></li>
<li>Minkowski Engine: <a href="https://github.com/StanfordVL/MinkowskiEngine">Code</a>, <a href="https://stanfordvl.github.io/MinkowskiEngine/">API</a></li>
</ul>
<h2 id="bibtex">Bibtex</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{choy20194d,
title={4d spatio-temporal convnets: Minkowski convolutional neural networks},
author={Choy, Christopher and Gwak, JunYoung and Savarese, Silvio},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={3075--3084},
year={2019}
}
</code></pre></div></div>Christopher B. ChoyAbstractText2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings2018-03-02T00:00:00-08:002018-03-02T00:00:00-08:00https://chrischoy.github.io/publication/text2shape<h2 id="abstract">Abstract</h2>
<p>We present a method for generating colored 3D shapes from natural language. To this end, we first learn joint embeddings of freeform text descriptions and colored 3D shapes. Our model combines and extends learning by association and metric learning approaches to learn implicit cross-modal connections, and produces a joint representation that captures the many-to-many relations between language and physical properties of 3D shapes such as color and shape. To evaluate our approach, we collect a large dataset of natural language descriptions for physical 3D objects in the ShapeNet dataset. With this learned joint embedding we demonstrate text-to-shape retrieval that outperforms baseline approaches. Using our embeddings with a novel conditional Wasserstein GAN framework, we generate colored 3D shapes from text. Our method is the first to connect natural language text with realistic 3D objects exhibiting rich variations in color, texture, and shape detail.</p>
<h1 id="video-summary">Video Summary</h1>
<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zraPvRdl13Q?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen=""></iframe>
</center>
<h1 id="additional-resources">Additional Resources</h1>
<ul>
<li><a href="https://arxiv.org/abs/1705.10904">ArXiv</a></li>
<li><a href="http://text2shape.stanford.edu/">Project page</a></li>
<li><a href="https://github.com/kchen92/text2shape/">Code</a></li>
</ul>Kevin ChenAbstractShort Note on Matrix Differentials and Backpropagation2018-01-10T18:55:12-08:002018-01-10T18:55:12-08:00https://chrischoy.github.io/research/Matric-Calculus<p>Mathematical notation is the convention that we all use to denote a concept
in a concise mathematical formulation, yet sometimes there is more than one
way to express the same equation. For example, we can use Leibniz’s notation
$\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y},
\ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations,
we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of
using the definition, we can use the frequency domain representations and
simply solve differential equations using basic algebra.
In this post, I’ll cover a matrix differential notation and how to use
differentials to derive backpropagation functions easily.</p>
<h2 id="differentials">Differentials</h2>
<p>Let’s first define the differential. Let a vector function $f(x)$ be
differentiable at $c$ and the first-order Taylor approximation is</p>
$$
f(c + u) = f(x) + f'(c)u + r(u)
$$
<p>where $r$ denotes the remainder. We denote $\mathsf{d}f(c;u) = u f’(c)$, the
differential of $f$ at $c$ with increment $u$. This of course can also be
denoted simply using partial derivatives.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
$$
\mathsf{d}f(c; u) = (\mathsf{D}f(c))u
$$
<p>where $\mathsf{D}_j f_i(c)$ denotes the partial derivative of $f_i$ with
respect to the $j$-th coordinate at $c$. The matrix $\mathsf{D}f(c)$ is the
Jacobian matrix and it’s the transpose of the gradient of $f$ at $c$.</p>
$$
\nabla f(c) = (\mathsf{D}f(c))^T
$$
<h3 id="chain-rule">Chain Rule</h3>
<p>Let $h = g \circ f$, the differential of $h$ at $c$ is</p>
\begin{align}
\mathsf{d}h(c; u) & = (\mathsf{D}(h(c))u \\
& = (\mathsf{D}g(f(c)))(\mathsf{D}f(c))u = (\mathsf{D}g(f(c)))\mathsf{d}f(c;u) \\
& = \mathsf{d}g(f(c); \mathsf{d}f(c;u))
\end{align}
<p>We can further simplify the notation by replacing $\mathsf{d}f(c;u)$ with
$\mathsf{d}f$ when it unambiguously represents the differential concerning
the input variable.</p>
<h3 id="matrix-function">Matrix Function</h3>
<p>Now let’s extend this to a matrix function. Let $F: S \rightarrow
\mathbb{R}^{m\times p}$ be a matrix function defined on $S \in \mathbb{R}^{n
\times q}$.</p>
$$
\text{vec}F(C+U) = \text{vec} F(C) + F'(C) \text{vec}U + \text{vec}R(U)
$$
<p>We can denote the differential of $F$ at $C$ as</p>
$$
\text{vec}\; \mathsf{d}F(C;U) = F'(C) \text{vec}U
$$
<h2 id="matrix-backpropagation">Matrix Backpropagation</h2>
<p>\todo{Introduce how we use the differential for backpropagation}
Let $A : B = \text{Tr}(A^TB) = \text{vec}(A) \text{vec}(B)^T$, sum of all
elements in $A \circ B$. If we let $F(X) = Y$ and $L$ be the final loss,</p>
\begin{align}
\mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}f \\
& = \mathsf{D} L : \mathcal{L}(\mathsf{d}X) \\
& = \mathcal{L}^*(\mathsf{D} L) : \mathsf{d}X
\end{align}
<p>where we denote $\mathsf{d}Y = \mathcal{L}(\mathsf{d}X)$ and $\mathsf{D}L =
\frac{\partial L}{\partial Y}$. So given gradients from the upper layer, the
gradient with respect to $X$ can easily be computed by finding the function
$\mathcal{L}^*$, the adjoint of $\mathcal{L}$.</p>
<h3 id="example-1-linear-function">Example 1: Linear Function</h3>
<p>Let $Y = f(X) = AX + b$, then,</p>
\begin{align}
\mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}(AX + b) \\
& = \mathsf{D} L : A\mathsf{d}X = \text{Tr}(\mathsf{D}L (A \mathsf{d}X)^T) \\
& = \text{Tr}(\mathsf{D}L \mathsf{d}X^T A^T) = \text{Tr}(A^T \mathsf{D}L \mathsf{d}X^T) \\
& = A^T \mathsf{D}L : \mathsf{d}X
\end{align}
<p>Thus $\mathcal{L}^*(Y) = A^TY$.</p>
<h3 id="example-2-constrained-optimization">Example 2: Constrained Optimization</h3>
<p>We would like to solve the following constrained optimization problem.</p>
\begin{equation*}
\begin{aligned}
& \underset{x}{\text{minimize}}
& & f(x) \\
& \text{subject to}
& & Ax = b.
\end{aligned}
\end{equation*}
<p>The Lagrangian and the primal and dual feasibility equations are</p>
$$
\mathcal{L}(x, \nu) = f(x) + \nu^T(Ax - b) \\
Ax^* = b, \;\; \nabla f(x) + A^T \nu^* = 0
$$
<p>If we take the first order approximation of the primal and dual feasibility
equations,</p>
\begin{align}
Ax + \mathsf{d}(Ax) & = b + \mathsf{d}b\\
Ax + \mathsf{d}Ax + A\mathsf{d}x & = b + \mathsf{d}b\\
\nabla f(x) + A^T \nu + \mathsf{d}(\nabla f(x) + A^T \nu) & = 0 \\
\nabla f(x) + A^T \nu + \nabla^2 f(x) \mathsf{d}x + \mathsf{d}A^T \nu + A^T \mathsf{d}\nu & = 0
\end{align}
<p>Or more concisely,</p>
$$
\begin{bmatrix} \nabla^2 f(x) & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} \mathsf{d}x \\ \mathsf{d}\nu \end{bmatrix} = - \begin{bmatrix} f(x) + A^T \nu + \mathsf{d}A^T\nu \\ Ax - b + \mathsf{d}Ax - \mathsf{d}b \end{bmatrix}
$$
<p>This is the same as the infeasible start Newton method <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we covered the notation for matrix differentials and matrix
backpropagation. Simple notation can ease the burden of derivation and also
lead to fewer mistakes.</p>
<h2 id="references">References</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="http://www.janmagnus.nl/misc/mdc2007-3rdedition">J. Magnus, Matrix Differential Calculus with Applications in Statistics</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf">S. Boyd and L. Vandenberghe, Convex Optimization</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduMathematical notation is the convention that we all use to denote a concept in a concise mathematical formulation, yet sometimes there is more than one way to express the same equation. For example, we can use Leibniz’s notation $\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y}, \ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations, we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of using the definition, we can use the frequency domain representations and simply solve differential equations using basic algebra. In this post, I’ll cover a matrix differential notation and how to use differentials to derive backpropagation functions easily.Regression vs. Classification: Distance and Divergence2018-01-05T15:06:57-08:002018-01-05T15:06:57-08:00https://chrischoy.github.io/research/Regression-Classification<p>In Machine Learning, supervised problems can be categorized into regression or
classification problems. The categorization is quite intuitive as the name
indicate. For instance, if the output, or the target value is a continuous
value, the model tires to regress on the value; and if it is discrete, we want
to predict a discrete value as well. A well-known example of such
classification problem is binary classification such as spam vs. non-spam.
Stock price prediction, or temperature prediction would be good examples of
regression.</p>
<p>To solve such problems, we have to use different methods. First, for regression
problems, the most widely used approach is to minimize the L1 or L2
distance between our prediction and the ground truth target. For classification
problems, 1-vs-all SVMs, multinomial logistic regression, decision forest, or
minimizing the cross entropy are popular choices.</p>
<p>Due to their drastically different treatment, sometimes, it is easy to treat
them as a complete separate problems. However, we can think of the
classification problem as regression problem in a non-Euclidean space and
extend this concept to Wasserstein and Cramer distances.</p>
<h2 id="designing-an-objective-risk-function-for-machine-learning-models">Designing an Objective (Risk) Function for Machine Learning Models</h2>
<p>To make a system that behaves as we expect, we have to design a loss (risk)
function that captures the behavior that we would like to see and define the
<strong>Risk</strong> associated with failures, or the loss function.</p>
<p>For example, let’s look at a typical image classification problem where we
classify an image into a semantic class such as car, person etc. Most datasets
use a mapping from a string (“Car”) to a numeric value so that we can handle
the dataset in a computer easily. For instance, we can assign 0 to “Bird”; 1 to
“Person”; 2 to “Car” etc.</p>
<p>However, the numbers do not have intrinsic meaning. The annotators use such
numbering since it is easy to process on a computer; but not because “Person” +
“Person” gives your “Car” nor because a person is “greater” (>) than a bird.
So, in this case, making a machine learning model to regress such values that
do not have intrinsic meaning would not make much sense.</p>
<p>On the other hand, if the number that we are trying to predict has actual
physical meaning and ordering makes sense (e.g., price, weight, the intensity of
light (pixel value) etc.), it would be reasonable to use the numbers directly
for prediction.</p>
<p>To state this notion clearly, let $y$ be the target value (label, supervision)
associated to an input $x$ and $f(\cdot)$ be a (parametric) function or a
machine learning model. i.e. when we feed $x$ to the function $\hat{y}=f(x)$,
we want the output $\hat{y}$ to approximate the target value $y$. So we need a
measure of <strong>how different</strong> the generated values are from the supervision.
Naturally, we use a distance function to measure how close a target is to the
prediction and we use the distance as the loss function (objective function or
a risk function).</p>
$$
\begin{align}
L(x, y) & = D(\hat{y}, y) \\
\end{align}
$$
<p>where $D(\cdot, \cdot)$ denotes a distance function.</p>
<h2 id="regression-vs-classification">Regression vs. Classification</h2>
<p>Now let’s look at the simplest regression problem: linear regression using
least squares fitting. In this setting, we have noise observation around the
ground truth line, and our task is to estimate the line. In this case, $f(x) =
Ax + b$ and $D(a, b) = ||a - b||_2^2$, square of the L2 norm. This gives also
can be interpreted as the maximum likelihood estimation under Gaussian noise.
However, the L2 norm is not the only distance measure used in regression problems.
L1 norm is sometimes used to enforce sparsity, and the Huber loss is used for
regression problems where outliers do not follow the Gaussian distribution.</p>
$$
D(\hat{y}, y) = \|\hat{y} - y\|_2
$$
<p>Let’s go back to the previous classification problem. Regressing the arbitrary
numeric values (labels) clearly is not the best way to train a machine learning
model as the numeric values do not have intrinsic meaning. Instead, we can use
the probability distribution rather than the arbitrary numeric values. For
example, for an image of a bird, which was class 0, we assign $P_0 = 1$ and 0
for the others: $P_{bird} = [1, 0, 0]$ where the elements are the probability
of the input being a bird, person, and car respectively. Using this
representation, we can train multinomial logistic regression, multi-class SVM.</p>
<h2 id="cross-entropy-and-f-divergence">Cross Entropy and f-Divergence</h2>
<p>However, how should we measure the “distance” between the ground truth label
distribution and the prediction distribution? Or is there a concept of distance
between two distributions? One family of functions that measures the
difference is known as the <strong>Ali-Silvey distances</strong>, or more widely known as
<strong>f-divergence</strong>, provides a measure function. Specifically, one type of the
f-divergence family is more widely used than others, and it is the
Kullback-Leibler divergence. Formally, given two distributions $P_\hat{y}$ and
$P_y$, the KL divergence is defined as</p>
$$
\begin{align}
D(P_\hat{y} || P_y) & = \sum_{i \in \mathcal{Y}} P(\hat{y} = i) \log \frac{P(\hat{y} = i)}{P(y = i)} \\
& = - H(P_y) + H(P_\hat{y}, P_y)
\end{align}
$$
<p>where $H(\cdot)$ is the entropy and $H(\cdot, \cdot)$ is the cross entropy.
In classification problems, where $P_\hat{y}, P_y$ denote prediction and
ground truth respectively, the first term is a constant, so we drop the entropy
and train our prediction model with the cross entropy only. That’s where you
get your cross entropy loss.</p>
<p>However, the KL divergence is not the only divergence. In fact, any convex
function $f: (0, \infty) \rightarrow \mathbb{R}$ such that $f(1) = 0$ can
define a divergence function.</p>
$$
D_f(P || Q) = \mathbb{E}_Q \left[ f \left( \frac{dP}{dQ} \right) \right]
$$
<p>For example, if we use $f(x) = \frac{1}{2}|x - 1|$, we have the Total Variation divergence.</p>
$$
\begin{align}
D_{TV}(P || Q) & = \frac{1}{2} \mathbb{E}_Q \left[ \left| \frac{dP}{dQ} - 1 \right| \right] \\
& = \frac{1}{2} \int |P - Q| = \frac{1}{2} ||P - Q||_1
\end{align}
$$
<p>One thing to note is that the KL divergence is not a proper <em>metric</em> as it is
asymmetric and violates the triangle inequality.</p>
<h2 id="wasserstein-distance-cramer-distance">Wasserstein Distance, Cramer Distance</h2>
<p>However, f-divergence is not the only way to measure the difference between two
distributions. In <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, the authors propose that f-divergence does not capture
our regular notion of distance accurately and propose to use a different
distance and led an interesting discussion in adversarial training.
Let’s first look at other “distance” functions that do not belong to the
f-divergence family.</p>
<p>First, the Wasserstein distance, also known as the probabilistic Earth Mover’s
Distance, computes the minimum mass that we need to move to match a
probability distribution to another.
\begin{align}
W_1(P, Q) = \inf \mathbb{E} [|x - y|]
\end{align}
</p>
<p>The infimum is over the joint distribution whose marginals are $P$ and $Q$. $x$ and
$y$ are defined over the space where $P$ and $Q$ have non zero support. One
of great follow up works <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> proposed to use yet another different distance
function, Cramer
distance, to remove sampling bias in the distance function. The Cramer
distance is simply the squared version of it</p>
\begin{align}
W_2(P, Q) = \left( \inf \mathbb{E} [|x - y|^2] \right)^{1/2}
\end{align}
<h2 id="conclusion">Conclusion</h2>
<p>Categorizing supervised problems into classification or regression can help we clearly understand the
problem, but sometimes it can limit our imagination and also limit the set of distance
functions that we can use.
Rather, in this post, we discussed how classification and regression could be understood
from how we measure differences. Classification by measuring difference using
f-divergence or even probabilistic distances and regression as Euclidean
distances. They are merely distances that measure the difference between a target
and a prediction. There are more popular distance functions, but the
set of the distance function is not set in stone. Sometimes, by defining the
distance function in a clever way, we can improve our ML model!</p>
<h1 id="references">References</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Arjovsky et al., Wasserstein Generative Adversarial Networks, 2017 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Bellemare et al., The Cramer Distance as a Solution to Biased Wasserstein Gradients, 2017 <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Chris Choychrischoy@ai.stanford.eduIn Machine Learning, supervised problems can be categorized into regression or classification problems. The categorization is quite intuitive as the name indicate. For instance, if the output, or the target value is a continuous value, the model tires to regress on the value; and if it is discrete, we want to predict a discrete value as well. A well-known example of such classification problem is binary classification such as spam vs. non-spam. Stock price prediction, or temperature prediction would be good examples of regression.Data Processing Inequality and Unsurprising Implications2018-01-04T21:11:58-08:002018-01-04T21:11:58-08:00https://chrischoy.github.io/research/data-processing-inequality-and-unsurprising-implications<p>We have heard enough about the great success of neural networks and how they
are used in real problems. Today, I want to talk about how it was so successful
(partially) from an information theoretic perspective and some lessons that we all
should be aware of.</p>
<h2 id="traditional-feature-based-learning">Traditional Feature Based Learning</h2>
<p>Before we figured out how to train a large neural network efficiently and fast,
traditional methods (such as hand designed features + shallow models like a
random forest, SVMs) have dominated Computer Vision. As you have guessed,
traditional method first starts from extracting features from an image, such
as the Histogram of Oriented Gradients (HOG), or Scale-Invariant
Feature Transform (SIFT) features. Then, we use the supervised metood of our choice
to train the second part of the model for prediction. So, what we are learning
is only from the extracted feature the prediction.</p>
$$
\text{Image} \rightarrow \text{Features} \underset{f(\cdot; \theta)}{\rightarrow} \hat{y}
$$
<p>The information from the image is bottlenecked by the quality of the feature and thus
many research had led to better, faster features. Here, to illustrate that the
learnable parameters are only in the second stage, I put $\theta$ in a function
below the second arrow.</p>
<h2 id="neural-network-as-an-end-to-end-system">Neural Network as an End-to-End system</h2>
<p>Unlike the traditional approach, the neural network based method starts
directly from the original inputs (of course, some preprocessing like centering,
and normalization, but they are reversible). We assume that the neural network
is a universal function approximator and optimize the parameters inside it to
approximate a complex function like the color of pixels to a semantic class!</p>
$$
\text{Image} \underset{f(\cdot; \theta)}{\rightarrow} \hat{y}
$$
<p>Unlike before, we are making a system that does not involve an intermediate
representation. Then, the natural questions that follow are why such system is
strictly better than the one that involves intermediate representation?, and is
it always the case?</p>
<h2 id="data-processing-inequality">Data Processing Inequality</h2>
<p>To generalize our discussion, let’s assume $X, Y, Z$ be the random variables
that form a Markov chain.</p>
$$
X \rightarrow Y \rightarrow Z
$$
<p>You can think of each arrow as a complex system that generates the best approximation
of whatever we want for each step. According to the data processing inequality,
the mutual information between $X$ and $Z$, $I(X; Z)$ cannot be greater than
that between $X$ and $Y$, $I(X; Y)$.</p>
$$
I(X;Y) \ge I(X;Z)
$$
<p>In other words, the information can only be lost and never increases as we
process it. For example in the traditional method, we extract feature $Y$ from an image $X$ with
a deterministic function. Given the feature, we
estimate the outcome $Z$. So, if we lost some information from the first feature
extraction stage, we cannot regain the lost information from the second stage.</p>
<p>However, in an end-to-end system, we do not enforce an intermediate
representation and thus remove $Y$ altogether.</p>
<h2 id="case-studies">Case Studies</h2>
<p>Now that we are equipped with the knowledge, let’s delve into some scenarios where you
should swing your big knowledge around. Can you tell your friendly colleague
ML what went wrong or how to improve the model?</p>
<h3 id="case-1-rgb-rightarrow-thermal-image-rightarrow-pedestrian-detection">Case 1: RGB $\rightarrow$ Thermal Image $\rightarrow$ Pedestrian Detection</h3>
<p>ML wants to localize pedestrians from RGB images.</p>
<p>ML: It is easier to predict pedestrians from thermal images, but thermal
images are difficult to acquire as the thermal cameras are not as common as
regular RGB cameras. So I will first predict thermal images from regular
images, then it would be easier to find pedestrian.</p>
<h3 id="case-2-monocular-image-rightarrow-3d-shape-prediction-rightarrow-weight">Case 2: Monocular Image $\rightarrow$ 3D shape prediction $\rightarrow$ Weight</h3>
<p>Again, ML is working on weight prediction from a monocular image (just a
regular image).</p>
<p>ML: Weight is a property associated with the shape of the object. If we can
predict the shape of an object first from an image, then predicting weight from
a 3D shape would be easier!</p>
<p>You can guess what went wrong probably. However, if we slightly tweak the
setting, we could improve the model. For example, in the case 1, instead of
feeding the RGB image only, RGB + Thermal $\rightarrow$ Pedestrian Detection,
would easily improve the performance.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We discussed how the data processing inequality could shed light on the success
of the neural network and the importance of an end-to-end system. However,
problems that you want to solve might not be as clear-cut as I
illustrated here. There are a lot of hair-splitting details that make the
difference. However, it is always important to remind what is theoretically
possible and maybe such split-second thought could save you a week of
implementation!</p>Chris Choychrischoy@ai.stanford.eduWe have heard enough about the great success of neural networks and how they are used in real problems. Today, I want to talk about how it was so successful (partially) from an information theoretic perspective and some lessons that we all should be aware of.