Computing Neural Network Gradients
Computing the neural network gradient requires very simple calculus, yet can be tedious.
Affine Transformation (Fully Connected Layer) Gradients
For a simple fully connected layer with batch size n with the feature dimension di and output feature dimension is do (di neurons as an inputs and do neurons as output), the layer has W∈Rdo×di and b∈Rdo as parameters. The resulting response Y∈Rdo×n is Y=AX+b.
The summation is a broadcasting sum.
The gradient ∂Yij∂Wpq can be computed as
∂Yij∂Wpq=∂∂Wpq(∑lWilXlj+bj)=XiqδipδjqAlso the gradient for the scaler b is
∂Yij∂bp=∂∂bp(∑lWilXlj+bj)=δjpwhere δij is a kronecker delta δij=1 if i=j, otherwise 0.
∂Yij∂Xpq=∂∂Xpq(∑lWilXlj)=WiqδipδjqLet’s say that the gradient from the upper layer propagated gradient ∂E∂Y∈Rn×do to the current layer. The resulting gradient for the weight W is
∂E∂W=∂E∂Y∂Y∂W=∂E∂YXTwhere Yi is the i th datapoint in the batch. The gradient propagated back to the lower layer is
∂E∂X=∂E∂Y∂Y∂X=WT∂E∂YThe only thing that you have to be careful with is matching the dimension of the matrices. Everything will work out if you just match the dimensions of matrices that you multiply. For instance, the dimension of W and TODO
ReLU Gradients
The Rectified Linear Unit (ReLU) has been widely used for simplicity and faster convergence. The ReLU units allow stronger gradients to propagate back to the lower layers, unlike sigmoid units or hyperbolic tangent units. Since there is no weight to learn, we can just compute the gradients ∂E∂X, where Y=f(X) and the f(⋅) is the element-wise ReLU.
∂E∂X=∂E∂Y∂Y∂X=∂E∂Y∘Δ(X)where ∘ is the Hadammard product and Δ(X) has elements δij=I(xij>0).
Generalized Convolution Gradients
The convolution is a conventional signal filtering technique. In CNN terminology, you can also think of it as 2D or ND filter that extract edges, blobs, or high frequency changes. In this example, we follow the caffe Blob convention and compute 2D convolution.
Unlike traditional signal processing convolution, neural network convolution is a generalization of convolution ∗. For instance, there is the stride which controls the step size, and you can think of the standard convolution as a convolution with the stride 1. Also the output convolution size is not n+k−1 where n is the input signal size and k is the kernel size. The neural network convolution kernel always fits within the input size. Thus the output size is n−k+1.
I will use ∗s to denote the convolution with stride s. Suppose that the result after convolution is Y=F∗sX+b where Y∈Rn×hy××wy and F∈Rc×hf×wf and X∈Rc×hx×wx and b∈R.
The summation is a broadcasting sum that makes the summation consistent.
If we add paddings along width and height, the problem becomes simplified since we do not have to account the borderline cases where filter only sees not enough data.
∂Ynhw∂Xncij=∑k∈{i,i+s,i+2s,…,i+rs}∑l∈{j,j+s,j+2s,…,j+os}Wckl=r−1∑k=0o−1∑l=0Wc,(i+ks),(j+ls)Where s is the stride (this is not standard convolution parameter). In standard convolution, the stride is 1. o and r are the number of strides it can make to make the indexing plausible.
For parameters,
∂Ynhw∂Wcij=Xn,(h+i),(w+j)∂Ynhw∂b=1Thus the gradient propagated through the network will be
∂E∂Xncij=∑k∑l∂E∂Ynkl∂Ynkl∂Xncij=∑k∑l∂E∂Ynklr−1∑k=0o−1∑l=0Wc,(i+ks),(j+ls)Also for the weight update,
∂E∂Wcij=∑p∑q∂E∂Ynij∂Ynij∂Wcij=∑p∑q∂E∂Ynpqr−1∑k=0o−1∑l=0Xc,(i+ks),(j+ls)∂E∂b=∑k∑l∂E∂Ynkl∂Ynkl∂b=∑k∑l∂E∂Ynkl
Leave a Comment