Computing Neural Network Gradients

Computing the neural network gradient requires very simple calculus, yet can be tedious.

Affine Transformation (Fully Connected Layer) Gradients

For a simple fully connected layer with batch size n with the feature dimension di and output feature dimension is do (di neurons as an inputs and do neurons as output), the layer has WRdo×di and bRdo as parameters. The resulting response YRdo×n is Y=AX+b.

The summation is a broadcasting sum.

The gradient YijWpq can be computed as

YijWpq=Wpq(lWilXlj+bj)=Xiqδipδjq

Also the gradient for the scaler b is

Yijbp=bp(lWilXlj+bj)=δjp

where δij is a kronecker delta δij=1 if i=j, otherwise 0.

YijXpq=Xpq(lWilXlj)=Wiqδipδjq

Let’s say that the gradient from the upper layer propagated gradient EYRn×do to the current layer. The resulting gradient for the weight W is

EW=EYYW=EYXT
Eb=EYYb=iEYi

where Yi is the i th datapoint in the batch. The gradient propagated back to the lower layer is

EX=EYYX=WTEY

The only thing that you have to be careful with is matching the dimension of the matrices. Everything will work out if you just match the dimensions of matrices that you multiply. For instance, the dimension of W and TODO

ReLU Gradients

The Rectified Linear Unit (ReLU) has been widely used for simplicity and faster convergence. The ReLU units allow stronger gradients to propagate back to the lower layers, unlike sigmoid units or hyperbolic tangent units. Since there is no weight to learn, we can just compute the gradients EX, where Y=f(X) and the f() is the element-wise ReLU.

EX=EYYX=EYΔ(X)

where is the Hadammard product and Δ(X) has elements δij=I(xij>0).

Generalized Convolution Gradients

The convolution is a conventional signal filtering technique. In CNN terminology, you can also think of it as 2D or ND filter that extract edges, blobs, or high frequency changes. In this example, we follow the caffe Blob convention and compute 2D convolution.

Unlike traditional signal processing convolution, neural network convolution is a generalization of convolution . For instance, there is the stride which controls the step size, and you can think of the standard convolution as a convolution with the stride 1. Also the output convolution size is not n+k1 where n is the input signal size and k is the kernel size. The neural network convolution kernel always fits within the input size. Thus the output size is nk+1.

I will use s to denote the convolution with stride s. Suppose that the result after convolution is Y=FsX+b where YRn×hy××wy and FRc×hf×wf and XRc×hx×wx and bR.

The summation is a broadcasting sum that makes the summation consistent.

If we add paddings along width and height, the problem becomes simplified since we do not have to account the borderline cases where filter only sees not enough data.

YnhwXncij=k{i,i+s,i+2s,,i+rs}l{j,j+s,j+2s,,j+os}Wckl=r1k=0o1l=0Wc,(i+ks),(j+ls)

Where s is the stride (this is not standard convolution parameter). In standard convolution, the stride is 1. o and r are the number of strides it can make to make the indexing plausible.

For parameters,

YnhwWcij=Xn,(h+i),(w+j)Ynhwb=1

Thus the gradient propagated through the network will be

EXncij=klEYnklYnklXncij=klEYnklr1k=0o1l=0Wc,(i+ks),(j+ls)

Also for the weight update,

EWcij=pqEYnijYnijWcij=pqEYnpqr1k=0o1l=0Xc,(i+ks),(j+ls)Eb=klEYnklYnklb=klEYnkl

Leave a Comment