Skip to content

Lecture 5: Convolutional Neural Networks

Convolution Layer

Official Definition

Convolution of two signals:

\[ f[x, y]*g[x, y] = \sum_{n_1 = - \infty}^{\infty}\sum_{n_2 = -\infty}^{\infty}f[n_1, n_2]g[x-n_1, y-n_2] \]

This is an elementwise multiplication and a sum of a filter and the signal (image).

Our Usage

Suppose we have a 32*32*3 image, and a 5*5*3 filter, we can perform convolution as following:

linear

  • Convolve the filter with the image, i.e. "slide over the image spatially, computing dot products". 我们这里的卷积是一个不那么严格的定义, 即\(f[x, y]\)\(g[x, y]\)做卷积时直接采用相同下标即可.
  • Filters always extend the full depth of the input volume. 卷积核和输入图层的深度必须一致.

我们可以使用多个卷积核处理图像, 每个卷积核处理完形成了一个图层, 那么最后输出图层的深度就是卷积核的个数, 如图:

linear

卷积核处理完的结果被称为 activation map(s).

Calculate the size of output volume:

Suppose our Conv layer accepts a volume of size \(W_1 \times H_1 \times D_1\), and the output volume is of the size \(W_2 \times H_2 \times D_2\), and we use \(K\) filters with size \(F \times F \times D_1\).

\[ W_2 = \frac{W_1 - F + 2P}{S} + 1 \]
\[ H_2 = \frac{H_1 - F + 2P}{S} + 1 \]
\[ D_2 = K \]
  • \(P\)代表了padding, 为了更好地提取图像边缘/角落信息, 需要在图像周围填充一些像素, 常见的填充方式为 zero padding.

linear

  • \(S\)代表了stride, 即卷积核在图像上移动的步长.

Example

linear 计算参数个数时不要忘记对于每个卷积核还有一个bias.

Note

linear

Fully Connected Layer

每个卷积神经网络中不光有Conv layer, 在最后还会包含一个或者多个FC layer, 比如在image classification问题中, 在卷积层之后的fc layer就用于生成每一类的预测分数.

linear

Pooling Layer

  • makes the representation smaller and more manageable.
  • operates over each activation map independently.

linear

最常见的pooling方式是max pooling, 即取每个activation map中每个窗口的最大值, 而且一般这些窗口不相交(stride = window size).

linear

  • Pooling处理后生成的volume和与原来的volume具有相同的depth.
  • Introduces zero parameters since it computes a fixed function of the input. Pooling layer 不会引入参数.
  • Note taht it is not common to use zero-padding for Pooling layers.

Convolutional Neural Networks Hierarchy

linear

  • ConvNets stack CONV, POOL and FC layers.
  • Tredn towards smaller fliters and deeper architectures.
  • Trend towards getting rid of POOL/FC layers(just CONV).
  • Typical architectures look like:
\[ \text{[(CONV-RELU)*N -> POOL]*M -> (FC-RELU)*K -> softmax} \]

where N is usually up to ~5, M is large, \(0\le K \le 2\).

Example

linear

Intuition of Activation Maps

在整个CNN中, 从输入图像到最后的FC layer之间, 会产生一系列的activation maps, 这些activation maps可以看做是CNN对于图像特征的提取. 一般来讲, 较为靠前的activation map提取的是较为简单的low level feature, 而靠后的activation map提取的是较为复杂的high level feature. 如图:

linear

linear

这张图片使用32个filter对输入图像进行了特征提取, 在activation map中我们是可以看到一些图像的特征的(边缘, 弧度, etc), 并且每个filter提取出的特征都不尽相同.