The automation of mechanical tasks brought the modern world unprecedented prosperity and comfort. However, the majority of automated tasks have been simple mechanical tasks that only require repetitive motion. Tasks that require visual perception and high-level cognition still have become the last frontiers of automation. Many of these tasks require visual perception such as automated warehouses where robots package items in disarray, autonomous driving where autonomous agents localize themselves, identify and track other dynamic objects in the 3D world. This ability to represent, identify, and interpret visual three-dimensional data to understand the underlying three-dimensional structure in the real world is known as 3D perception. In this dissertation, we propose learning-based approaches to tackle challenges in 3D perception. Specifically, we propose a set of high-dimensional convolutional neural networks for three categories of problems in 3D perception: reconstruction, representation learning, and registration.
Reconstruction is the first step that generates 3D point clouds or meshes from a set of sensory inputs. We present supervised reconstruction methods using 3D convolutional neural networks that take a set of images as input and generate 3D occupancy patterns in a grid as output. We train the networks with a large-scale 3D shape dataset to generate a set of images rendered from various viewpoints validate the approach on real image datasets. However, supervised reconstruction requires 3D shapes as labels for all images, which are expensive to generate. Instead, we propose using a set of foreground masks and unlabeled real 3D shapes to train the reconstruction network as weaker supervision. Combined with the learned constraint, we train the reconstruction system with as few as 1 image and show that the proposed model without direct 3D supervision.
In the second part of the dissertation, we present sparse tensor networks, neural networks for spatially sparse tensors. As we increase the spatial dimension, the sparsity of input data decreases drastically as the volume of the space increases exponentially. Sparse tensor networks exploit such inherent sparsity in the input data and efficiently process them. With the sparse tensor network, we create a 4-dimensional convolutional network for spatio-temporal perception for 3D scans or a sequence of 3D scans (3D video). We show that 4-dimensional convolutional neural networks can effectively make use of temporal consistency and improve the accuracy of segmentation. Next, we use the sparse tensor networks for geometric representation learning to capture both local and global 3D structures accurately for correspondences and registration. We propose fully convolutional networks and new types of metric learning losses that allow neurons to capture large context while capturing local spatial geometry. We experimentally validate our approach on both indoor and outdoor datasets and show that the network outperforms the state-of-the-art method while being a few orders of magnitude faster.
In the third and the last part of the dissertation, we discuss high-dimensional pattern recognition problems in image and 3D registration. We first propose high-dimensional convolutional networks from 4 to 32-dimensional spaces and analyze the geometric pattern recognition capacity of these high-dimensional convolutional networks for linear regression problems. Next, we show that the 3D correspondences form a hyper-surface in 6-dimensional space; and 2D correspondences form a 4-dimensional hyper-conic section, which we detect using high-dimensional convolutional networks. We extend the proposed high-dimensional convolutional networks for differentiable 3D registration and propose three core modules for this: a 6-dimensional convolutional neural network for correspondence confidence prediction; a differentiable Weighted Procrustes method for closed-form pose estimation; and a robust gradient-based 3D rigid transformation optimizer for pose refinement. Experiments demonstrate that our approach outperforms state-of-the-art learning-based and classical methods on real-world data while maintaining efficiency.
The thesis is posted on the Stanford Digital Repository: Thesis.
You can access each chapter without downloading the full thesis from the following list.
- Chapter 1: Introduction
- Chapter 2: Supervised Reconstruction
- Chapter 3: Weakly-supervised Reconstruction
- Chapter 4: Sparse Tensor Networks
- Chapter 5: Spatio-Temporal Segmentation
- Chapter 6: Geometric Features
- Chapter 7: Geometric Pattern Recognition
- Chapter 8: Global Registration
- Chapter 9: Conclusion
Thesis Defense Slides
Slides for my PhD oral defense are available at: Slides