Ghost of 3D Perception: Permutation Invariance Matters? Convolutions are Permutation Invariant!

Are you familiar with the python dictionary class? Let me give you a quick test to check your level of knowledge.

a = dict()
a[1.1] = 1
a[2.1] = 2
b = dict()
b[2.1] = 2
b[1.1] = 1

Do you think the dictionaries a and b are practically the same? Congratulations! I guess you are an expert in python! Let me give you a more challenging problem.

Is the OctNet Permutation Invariant?

The OctNet first published on arXiv 2016 1 and O-CNN 2 proposed to do convolution on a sparse voxel grid using an sparse octree. The OctNet used convolution on both surface and interior whereas the O-CNN used convolution only on the surface. In their implementation, not surprisingly, it is guaranteed to get the same results even when you permute your data arbitrarily. Then, the answer is YES. The OctNet is permutation invariant! In addition, as it operates on a sparse representation, we can apply these networks on large-scale scenes.

Discretization $\nsim$ Permutation Invariance

Well, you might claim that these networks operate on a discretized space, not the original continuous representation. You are absolutely right that it operates on a discretized representation. However, think about the difference between permutation invariance and discretization. Do you think they are the same concept?

Let me give you a quick python test again.

a = dict()
a[1] = 1
a[2] = 2
b = dict()
b[2] = 2
b[1] = 1

Unlike the previous test where we put 1.1 and 2.1, we now put 1 and 2. However, the dictionaries a and b remain to be practically the same and permutation invariant! So to reiterate, some data structures can be both discrete and permutation invariant. Others can be continuous and permutation invariant. Discretization and Permutation Invariance are completely different concepts.

Is Convolution Permutation Invariant?

Then let’s think of why these convolutional neural networks are permutation invariant. Doesn’t our good old 2D CNNs require data to be in a specific format? Before we get into this, let’s look at the definiton. What is convolution? If I borrow the definition from Wikipedia,

convolution is a mathematical operation on two functions (f and g) that produces a third function ($ f * g$) that expresses how the shape of one is modified by the other.

This is pretty vague. Let’s look at the discrete 1D definition:

\[(f*g)[n]=\sum_{m=-\infty }^{\infty }f[n - m]g[m]\]

Hmm, this is not the form that I’m familiar with. Let me convert this even more and limit the support of $g$ to be in $[-k, k]$. Also, I’ll mirror $g$ to not deal with the negative sign in the index.

\[(f*g)[n]=\sum_{m=-k }^{k }f[n + m]g[m]\]

Let me apply some tricks to convert this even further. First, we can vectorize the feature and convert the kernel scalar $g[m]$ to a matrix $G[m]$.

\[(\mathbf{f}*\mathcal{G})[n]=\sum_{m=-k }^{k }G[m]\mathbf{f}[n + m]\]

Now, it looks like the regular old convolution that we use in computer vision! $f$ is the input and $g$ is the convolution kernel with kernel size $k$.

Also, let’s define the support of $\mathcal{G}$ to be $\mathcal{N}(n)$ a list of neighbors that we have around $n$. This can be the regular old $[-k, k]$, but in some cases $f[n + k] = 0$ like spatially sparse voxel grid, in which reducing $[-k, k]$ to $\mathcal{N}(n)$ gives you the identical results, yet it reduces the number of matrix-vector multiplications.

\[(\mathbf{f}*\mathcal{G})[n]=\sum_{m \in \mathcal{N}(n)}G[m]\mathbf{f}[n + m]\]

Now, this looks like the generalized convolution!

If we look at this closely, all convolution here does not have a comment like f must be contiguous, or f must be sorted. It only defines the $f[n]$ to be well defined and accessible! In other words, we need to access $f[n]$ arbitrarily, which is random access. All we need is a data structure that supports this random access!

It’s Random Access Stupid!

So what kind of data structures support random access? Clearly, linked lists and arrays are not the ideal data structure for this, but it is possible. We can search for the key or n in this case, but this will be very inefficient. Then, let’s look at other efficient data structures.

The first and the simplest data structure that comes to my mind is array. Let’s use a float pointer float* data for this exercise.

data[n] := access_value_at(data + n)

This simple math operation data + n performs random access!

What are other types of data structures that allow random access?

Another option is obviously the hash table.

data[n] := access_value_at(data + hash(n) % table_size)

If you want it to be slightly fancy, you could do

data[n] := access_value_at(collision_handle(data + hash(n) % table_size))

Trees are also a good option, but I think you probably got the point that I’m trying to say: arrays are not the only data structure! So if we use these data structures, we can create (inherently permutation invariant) convolutions with sparse data!

Why have we thought that convolution is not permutation invariant? Bias from 2D ConvNets

There is a knee-jerk reaction to some people when I say convolutions are permutation invariant. This I suspect is from the implicit assumption that we are accustomed to in 2D image processing: Images use arrays for data structure and we use 2D convnets for images, so to use convolutions data has to be contiguous!. You probably have guessed it; not all data need to use arrays for their data structure and convolution doesn’t require contiguous data. However, as we’ve seen in the previous section, the definition of convolution doesn’t say anything about the data structure it uses as long as it allows random access.

Conclusion

So to simply put,

  • convolution is permutation invariant, it always has been.
  • The familiarity with 2D image processing made us think that convolution and arrays go hand in hand, but it is not. We can decouple convolution from arrays and use data structures that allow random access for convolution.

PS

Q: Why is the title ghost of 3D perception? A: The permutation invariance is something everybody talks about in 3D perception, but it has no practical implications like a metaphysical world.

References

  1. OctNet: Learning Deep 3D Representations at High Resolutions, arXiv’16, CVPR’17 

  2. O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis, Siggraph’17 

Leave a Comment