Before Reading

CS231n Convolutional Neural Networks for Visual Recognition

Convolution

extract local features by using convolution kernel/filter with shape (hwd), where the depth (d) is determined by the number of input channels(depth of input image).

Stride: avoid overfiting

stride=1, convolution kernel(cube 331)

stride=1, convolution kernel(cube 331)

Untitled

Convolutional Neural Network(CNN)

Motivation

in image processing and speech recognition, if use fully connected neural network, such a large quantity of model parameters is usually difficult to converge, resulting in poor performance

CNN

Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.

So what changes? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.

Core Idea of CNN – Local Perception

Local Features Are the Key

Core Idea of CNN - Parameter Sharing

assumption: if one feature is useful to compute at some spatial position), then it should also be useful to compute at a different position due to the translationally-invariant structure of images.