原文地址

在本文中,我们将深入探讨卷积神经网络(CNN):如何训练CNN(包括梯度推导),从头实现反向传播(仅使用numpy)以及最终构建完整的训练!

本文假定你对CNNs有基础了解,如果否,请参考CNN01

本部分同时也要求你有多元微分的知识. 您可以根据需要跳过这些部分,但还是建议阅读它们,对写代码有用。

开始

我们从上一篇文章停下来的地方开始,我们正在使用CNN来解决MNIST手写数字识别问题:

来自MNIST数据集的样本图像

我们(简单的)CNN由一个卷机层(Conv layer),一个最大池化层(Max Pooling layer)和一个Softmax层组成。下面是一个简单的示意图:

Our (simple) CNN consisted of a Conv layer, a Max Pooling layer, and a Softmax layer. Here's that diagram of our CNN again:

我们写了三个类, 分别是这三个层: Conv3x3, MaxPool, 和 Softmax. 每个类都有一个 forward() ,一种用于构建CNN前向传递的方法

# Header: cnn.py
conv = Conv3x3(8)                  # 28x28x1 -> 26x26x8
pool = MaxPool2()                  # 26x26x8 -> 13x13x8
softmax = Softmax(13 * 13 * 8, 10) # 13x13x8 -> 10

def forward(image, label):
  '''
  Completes a forward pass of the CNN and calculates the accuracy and
  cross-entropy loss.
  - image is a 2d numpy array
  - label is a digit
  '''
  # We transform the image from [0, 255] to [-0.5, 0.5] to make it easier
  # to work with. This is standard practice.
  out = conv.forward((image / 255) - 0.5)
  out = pool.forward(out)
  out = softmax.forward(out)

  # Calculate cross-entropy loss and accuracy. np.log() is the natural log.
  loss = -np.log(out[label])
  acc = 1 if np.argmax(out) == label else 0

  return out, loss, acc

你可以阅读代码或 在浏览器中运行. 也可以在Github上找到它.

下面是CNN的输出:

MNIST CNN initialized!
[Step 100] Past 100 steps: Average Loss 2.302 | Accuracy: 11%
[Step 200] Past 100 steps: Average Loss 2.302 | Accuracy: 8%
[Step 300] Past 100 steps: Average Loss 2.302 | Accuracy: 3%
[Step 400] Past 100 steps: Average Loss 2.302 | Accuracy: 12%

显然,我们希望能有10%以上的准确度……让我们训练CNN。

训练概述

训练神经网络通常包括两个阶段:

  1. 一个前向阶段,其中输入完全通过神经网络。
  2. 一个后向阶段,其中梯度被反向传播(backpropagated or backprop),而且权重不断更新。

我们将遵循这种模式来训练我们的CNN。我们还将使用两个主要特定于实现的想法:

  • 在前向阶段,每一层将缓存(cashe)任何后向阶段所需的任何数据(例如输入,中间值等)。这意味着任何后向阶段都必须在相应的前向阶段之前。
  • 在后向阶段,每一层将接收一个梯度(gradient)并且还返回一个梯度。它将获得相对于其输出的损失梯度 ($\frac{\partial L}{\partial \text{out}}$) ,并返回相对于其输入的损失梯度($\frac{\partial L}{\partial \text{in}}$​).

这两个想法将有助于保持我们训练的整洁有序。了解原因的最好方法可能是查看代码:

# Feed forward
out = conv.forward((image / 255) - 0.5)
out = pool.forward(out)
out = softmax.forward(out)

# Calculate initial gradient
gradient = np.zeros(10)
# ...

# Backprop
gradient = softmax.backprop(gradient)
gradient = pool.backprop(gradient)
gradient = conv.backprop(gradient)

看,我们的代码是不是干净多了?现在想象一下,建立一个由50层而不是3层组成的网络-他甚至比良好的系统更有价值

反向传播(Backprop): Softmax

首先,回顾一下交叉熵损失(cross-entropy loss):

$$ L = -\ln(p_c) $$

其中 $p_c$是正确组c(换句话说,我们当前图像的实际数字是多少)的预测概率。

需要更多解释?回顾前一篇文章的Cross-Entropy Loss

我们需要计算的第一件事是反向阶段Softmax层的输入, $\frac{\partial L}{\partial out_s}$,其中$out_s$是Softmax层的输出:10个概率组成的向量。这很容易,因为只有$p_i$出现在损耗方程中:

$$ \frac{\partial L}{\partial out_s(i)} = \begin{cases} 0 & \text{if $i \neq c$} \\ -\frac{1}{p_i} & \text{if $i = c$} \\ \end{cases} $$

提醒:c是正确的组。

这就是您看到的初始梯度。

# Calculate initial gradient
gradient = np.zeros(10)
gradient[label] = -1 / out[label]

我们几乎已经准备好实现我们的第一个反向阶段-只需要首先执行前面讨论的前向阶段缓存即可:

# Header: softmax.py
class Softmax:
  # ...

  def forward(self, input):
    '''
    Performs a forward pass of the softmax layer using the given input.
    Returns a 1d numpy array containing the respective probability values.
    - input can be any array with any dimensions.
    '''
    self.last_input_shape = input.shape # highlight-line

    input = input.flatten()
    self.last_input = input # highlight-line

    input_len, nodes = self.weights.shape

    totals = np.dot(input, self.weights) + self.biases
    self.last_totals = totals # highlight-line

    exp = np.exp(totals)
    return exp / np.sum(exp, axis=0)

我们在此处缓存3件事,这对于实现后向阶段非常有用:

  • 降维( flatten)之前,input的形状(shape)。
  • 降维之后的input
  • 总数,这是给SOFTMAX激活函数传的值。

有了这些,我们就可以开始推导反向传播阶段的梯度。我们已经得到了Softmax反向阶段的梯度:$\frac{\partial L}{\partial out_s}$​。有这样一个事实: $\frac{\partial L}{\partial out_s}$只有一个非零项c,即正确组。这意味着我们可以忽略一切,除了 $out_s(c)$!

首先,计算$out_s(c)$(传递给softmax激活函数的值)相对于总数的的梯度。令第i组的总数为$t_i$ ,然后 $out_s(c)$ 可以表示为:

$$ out_s(c) = \frac{e^{t_c}}{\sum_i e^{t_i}} = \frac{e^{t_c}}{S} $$

其中 $S = \sum_i e^{t_i}$.

需要复习Softmax?阅读我的 Softmax简易介绍.

然后考虑第k组($k \neq c$). $out_s(c)$ 可以表示为:

$$ out_s(c) = e^{t_c} S^{-1} $$

使用链式法则求导

$$ \begin{aligned} \frac{\partial out_s(c)}{\partial t_k} &= \frac{\partial out_s(c)}{\partial S} (\frac{\partial S}{\partial t_k}) \\ &= -e^{t_c} S^{-2} (\frac{\partial S}{\partial t_k}) \\ &= -e^{t_c} S^{-2} (e^{t_k}) \\ &= \boxed{\frac{-e^{t_c} e^{t_k}}{S^2}} \\ \end{aligned} $$

记住,这里假设 $k \neq c$. 现在我们同样可以对c求导, 这次要使用 除法法则 (因为$e^{t_c}$在 $out_s(c)$的分子上):

$$ \begin{aligned} \frac{\partial out_s(c)}{\partial t_c} &= \frac{S e^{t_c} - e^{t_c} \frac{\partial S}{\partial t_c}}{S^2} \\ &= \frac{Se^{t_c} - e^{t_c}e^{t_c}}{S^2} \\ &= \boxed{\frac{e^{t_c} (S - e^{t_c})}{S^2}} \\ \end{aligned} $$

上面是本文最难的部分,接下来会变得简单。

相应代码:

# Header: softmax.py
class Softmax:
  # ...

  def backprop(self, d_L_d_out):
    '''
    Performs a backward pass of the softmax layer.
    Returns the loss gradient for this layer's inputs.
    - d_L_d_out is the loss gradient for this layer's outputs.
    '''
    # We know only 1 element of d_L_d_out will be nonzero
    for i, gradient in enumerate(d_L_d_out):
      if gradient == 0:
        continue

      # e^totals
      t_exp = np.exp(self.last_totals)

      # Sum of all e^totals
      S = np.sum(t_exp)

      # Gradients of out[i] against totals
      d_out_d_t = -t_exp[i] * t_exp / (S ** 2)
      d_out_d_t[i] = t_exp[i] * (S - t_exp[i]) / (S ** 2)

      # ... to be continued

记住 $\frac{\partial L}{\partial out_s}$ 只有正确组才是唯一的非零项。 We start by looking for $c$ by looking for a nonzero gradient in 我们开始通过在d_L_d_out寻找非零梯度来找到这个c. 一旦我们找到了,就可以使用上面的结果来计算梯度$\frac{\partial out_s(i)}{\partial t}$ (d_out_d_totals):

$$ \frac{\partial out_s(k)}{\partial t} = \begin{cases} \frac{-e^{t_c} e^{t_k}}{S^2} & \text{if $k \neq c$} \\ \frac{e^{t_c} (S - e^{t_c})}{S^2} & \text{if $k = c$} \\ \end{cases} $$

最终我们希望得到一个关于权重、偏置和输入的损失梯度:

  • 我们用权重梯度$\frac{\partial L}{\partial w}$, 更新权重
  • 用偏置梯度, $\frac{\partial L}{\partial b}$, 更新偏置.
  • 从方法backprop() 返回输入梯度, $\frac{\partial L}{\partial input}$,以便下一层可以使用它。这就是我们在概述一节中所讲的返回梯度!

为了计算这三个损失梯度,我们首先需要得到另外三个结果:相对于权重、偏置和输入的总和(total)梯度。相关方程式如下:

$$ t = w * input + b $$

$$ \frac{\partial t}{\partial w} = input $$

$$ \frac{\partial t}{\partial b} = 1 $$

$$ \frac{\partial t}{\partial input} = w $$

将所有内容放到一起:

$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial out} * \frac{\partial out}{\partial t} * \frac{\partial t}{\partial w} $$

$$ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial out} * \frac{\partial out}{\partial t} * \frac{\partial t}{\partial b} $$

$$ \frac{\partial L}{\partial input} = \frac{\partial L}{\partial out} * \frac{\partial out}{\partial t} * \frac{\partial t}{\partial input} $$

代码实现:

# Header: softmax.py
class Softmax:
  # ...

  def backprop(self, d_L_d_out):
    '''
    Performs a backward pass of the softmax layer.
    Returns the loss gradient for this layer's inputs.
    - d_L_d_out is the loss gradient for this layer's outputs.
    '''
    # We know only 1 element of d_L_d_out will be nonzero
    for i, gradient in enumerate(d_L_d_out):
      if gradient == 0:
        continue

      # e^totals
      t_exp = np.exp(self.last_totals)

      # Sum of all e^totals
      S = np.sum(t_exp)

      # Gradients of out[i] against totals
      d_out_d_t = -t_exp[i] * t_exp / (S ** 2)
      d_out_d_t[i] = t_exp[i] * (S - t_exp[i]) / (S ** 2)

      # highlight-start
      # Gradients of totals against weights/biases/input
      d_t_d_w = self.last_input
      d_t_d_b = 1
      d_t_d_inputs = self.weights
 
      # Gradients of loss against totals
      d_L_d_t = gradient * d_out_d_t
 
      # Gradients of loss against weights/biases/input
      d_L_d_w = d_t_d_w[np.newaxis].T @ d_L_d_t[np.newaxis]
      d_L_d_b = d_L_d_t * d_t_d_b
      d_L_d_inputs = d_t_d_inputs @ d_L_d_t
      # highlight-end

      # ... to be continued

首先,我们预先计算d_L_d_t 因为它将被多次使用。然后,我们计算每个梯度:

  • d_L_d_w:我们需要二维数组做矩阵乘法(@),但d_t_d_wd_L_d_t是一维数组。np.newaxis使我们可以轻松地创建长度为1的新轴,因此最终将 (input_len, 1) 位矩阵与 (1, nodes)维矩阵相乘。因此,的最终结果d_L_d_w的形状是(input_len, nodes),与self.weights形状相同!
  • d_L_d_b:因为d_t_d_b是1 ,所以这很简单。
  • d_L_d_inputs:我们将(input_len, nodes) 维矩阵与 (nodes, 1) 维矩阵相乘,得到长度input_len的结果。

上述计算例子是理解为什么代码长这样的最好方式。

计算完所有梯度后,剩下的就是实际训练Softmax层!就像我们在介绍神经网络时所做的那样,我们将使用随机梯度下降(SGD)更新权重和偏置,然后返回d_L_d_inputs

# Header: softmax.py
class Softmax
  # ...

  def backprop(self, d_L_d_out, learn_rate): # highlight-line
    '''
    Performs a backward pass of the softmax layer.
    Returns the loss gradient for this layer's inputs.
    - d_L_d_out is the loss gradient for this layer's outputs.
    - learn_rate is a float # highlight-line
    '''
    # We know only 1 element of d_L_d_out will be nonzero
    for i, gradient in enumerate(d_L_d_out):
      if gradient == 0:
        continue

      # e^totals
      t_exp = np.exp(self.last_totals)

      # Sum of all e^totals
      S = np.sum(t_exp)

      # Gradients of out[i] against totals
      d_out_d_t = -t_exp[i] * t_exp / (S ** 2)
      d_out_d_t[i] = t_exp[i] * (S - t_exp[i]) / (S ** 2)

      # Gradients of totals against weights/biases/input
      d_t_d_w = self.last_input
      d_t_d_b = 1
      d_t_d_inputs = self.weights

      # Gradients of loss against totals
      d_L_d_t = gradient * d_out_d_t

      # Gradients of loss against weights/biases/input
      d_L_d_w = d_t_d_w[np.newaxis].T @ d_L_d_t[np.newaxis]
      d_L_d_b = d_L_d_t * d_t_d_b
      d_L_d_inputs = d_t_d_inputs @ d_L_d_t

      # highlight-start
      # Update weights / biases
      self.weights -= learn_rate * d_L_d_w
      self.biases -= learn_rate * d_L_d_b
 
      return d_L_d_inputs.reshape(self.last_input_shape)
      # highlight-end

注意,我们添加了一个learn_rate参数来控制我们更新权重的速度。另外,我们必须在返回d_L_d_inputs 之间进行reshape()操作,因为在前向传递过程中我们将输入降维了:

# Header: softmax.py
class Softmax:
  # ...

  def forward(self, input):
    '''
    Performs a forward pass of the softmax layer using the given input.
    Returns a 1d numpy array containing the respective probability values.
    - input can be any array with any dimensions.
    '''
    self.last_input_shape = input.shape

    input = input.flatten() # highlight-line
    self.last_input = input

    # ...

改变 last_input_shape 的形状以使该层为输入返回的梯度具有与原始给定的输入一样的格式。

测试:Softmax Backprop

我们已经完成了第一个反向传播方案!让我们快速地进行测试。我们将应用train()方法到前一篇文章的代码程序

# Header: cnn.py
# Imports and setup here
# ...

def forward(image, label):
  # Implementation excluded
  # ...

def train(im, label, lr=.005):
  '''
  Completes a full training step on the given image and label.
  Returns the cross-entropy loss and accuracy.
  - image is a 2d numpy array
  - label is a digit
  - lr is the learning rate
  '''
  # Forward
  out, loss, acc = forward(im, label)

  # Calculate initial gradient
  gradient = np.zeros(10)
  gradient[label] = -1 / out[label]

  # Backprop
  gradient = softmax.backprop(gradient, lr)
  # TODO: backprop MaxPool2 layer
  # TODO: backprop Conv3x3 layer

  return loss, acc

print('MNIST CNN initialized!')

# Train!
loss = 0
num_correct = 0
for i, (im, label) in enumerate(zip(train_images, train_labels)):
  if i % 100 == 99:
    print(
      '[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%' %
      (i + 1, loss / 100, num_correct)
    )
    loss = 0
    num_correct = 0

  l, acc = train(im, label)
  loss += l
  num_correct += acc

结果将类似于

MNIST CNN initialized!
[Step 100] Past 100 steps: Average Loss 2.239 | Accuracy: 18%
[Step 200] Past 100 steps: Average Loss 2.140 | Accuracy: 32%
[Step 300] Past 100 steps: Average Loss 1.998 | Accuracy: 48%
[Step 400] Past 100 steps: Average Loss 1.861 | Accuracy: 59%
[Step 500] Past 100 steps: Average Loss 1.789 | Accuracy: 56%
[Step 600] Past 100 steps: Average Loss 1.809 | Accuracy: 48%
[Step 700] Past 100 steps: Average Loss 1.718 | Accuracy: 63%
[Step 800] Past 100 steps: Average Loss 1.588 | Accuracy: 69%
[Step 900] Past 100 steps: Average Loss 1.509 | Accuracy: 71%
[Step 1000] Past 100 steps: Average Loss 1.481 | Accuracy: 70%

损失在减少,准确性在上升,CNN已经开始学习了!

Backprop: 最大池化

不能训练最大池化层,因为它实际上没有任何权重,但是我们仍然需要实现一个backprop()方法来计算梯度。我们将再次从添加前向阶段缓存开始。这次需要缓存的只是输入:

# Header: maxpool.py
class MaxPool2:
  # ...

  def forward(self, input):
    '''
    Performs a forward pass of the maxpool layer using the given input.
    Returns a 3d numpy array with dimensions (h / 2, w / 2, num_filters).
    - input is a 3d numpy array with dimensions (h, w, num_filters)
    '''
    self.last_input = input #highlight-line

    # More implementation
    # ...

在前向传递过程中,“最大池化”(Max Pooling)层将获取一个立体(三维)输入,并通过选择2x2块上的最大值将其宽度和高度尺寸减半。后向传递则相反:通过将每个梯度值分配给其相应2x2块中原始最大值所在的位置,我们将损失梯度的宽度和高度加倍

这是一个例子。最大池化层的前向阶段:

将4x4输入转换为2x2输出的前向阶段示例

后向阶段如下:

向后阶段将2x2梯度转换为4x4梯度的示例

每个梯度值都分配给原始最大值所在的位置,其他每个值均为零。

为什么Max Pooling层的后向阶段这样起作用?从直觉上考虑一下$\frac{\partial L}{\partial inputs}$。如果输入像素不是其2x2块中的最大值,则对损失的边际影响为零,因为稍微更改该值根本不会改变输出!换一种说法,非最大像素的$\frac{\partial L}{\partial input} = 0$。另一方面,该输入像素最大值会使它的值传递给输出,所以 $\frac{\partial output}{\partial input} = 1$,意味着$\frac{\partial L}{\partial input} = \frac{\partial L}{\partial output}$。

我们可以使用在第1部分中编写iterate_regions()方法非常快地实现这一点:

# Header: maxpool.py
class MaxPool2:
  # ...

  def iterate_regions(self, image):
    '''
    Generates non-overlapping 2x2 image regions to pool over.
    - image is a 2d numpy array
    '''
    h, w, _ = image.shape
    new_h = h // 2
    new_w = w // 2

    for i in range(new_h):
      for j in range(new_w):
        im_region = image[(i * 2):(i * 2 + 2), (j * 2):(j * 2 + 2)]
        yield im_region, i, j

  def backprop(self, d_L_d_out):
    '''
    Performs a backward pass of the maxpool layer.
    Returns the loss gradient for this layer's inputs.
    - d_L_d_out is the loss gradient for this layer's outputs.
    '''
    d_L_d_input = np.zeros(self.last_input.shape)

    for im_region, i, j in self.iterate_regions(self.last_input):
      h, w, f = im_region.shape
      amax = np.amax(im_region, axis=(0, 1))

      for i2 in range(h):
        for j2 in range(w):
          for f2 in range(f):
            # If this pixel was the max value, copy the gradient to it.
            if im_region[i2, j2, f2] == amax[f2]:
              d_L_d_input[i * 2 + i2, j * 2 + j2, f2] = d_L_d_out[i, j, f2]

    return d_L_d_input

对于每个卷积核中每个2x2图像区域中的每个像素,如果是前向传递的最大值,就复制d_L_d_outd_L_d_input

接下来进入我们最后一层

反向传播: Conv

通过Conv层进行反向传播是训练CNN的核心。

# Header: conv.py
class Conv3x3
  # ...

  def forward(self, input):
    '''
    Performs a forward pass of the conv layer using the given input.
    Returns a 3d numpy array with dimensions (h, w, num_filters).
    - input is a 2d numpy array
    '''
    self.last_input = input # highlight-line

    # More implementation
    # ...

提醒一下:为简单起见,我们假设转换层的输入是二维数组。这仅对本例有效,因为我们将其用作网络的第一层。如果我们要建立一个需要多次使用Conv3x3的更大的网络,则必须将输入设为三维数组。

我们主要对conv层中卷积核的损耗梯度感兴趣,因为我们需要用它来更新卷积核的权重。对于conv层,我们已经有$\frac{\partial L}{\partial out}$,所以我们只需要 $\frac{\partial out}{\partial filters}$​。要计算这一点,我们要问自己:改变卷积核的权重将如何影响conv层的输出?

现实情况是,更改任何卷积核权重都会影响该滤波器的整个输出图像,因为在卷积过程中每个输出像素都会使用每个像素权重。为了更容易明白,让我们一次只考虑一个输出像素:修改滤镜将如何改变一个特定输出像素的结果?

这是一个超级简单的示例,可以帮助您思考以下问题:

3x3图像(左)与3x3卷积核(中间)卷积后产生1x1输出(右)

如果我们将中间卷积核的权重增加为1,会怎么样?输出值将会是80

类似地,将任何位置的卷积核权重增加1将使输出增加相应图像像素的值!这表明特定输出像素相对于特定卷积核权重的导数就是相应的图像像素值。进行数学运算可以验证这一点:

$$ \begin{aligned} \text{out(i, j)} &= \text{convolve(image, filter)} \\ &= \sum_{x=0}^3 \sum_{y=0}^3 \text{image}(i + x, j + y) * \text{filter}(x, y) \\ \end{aligned} $$

$$ \frac{\partial \text{out}(i, j)}{\partial \text{filter}(x, y)} = \text{image}(i + x, j + y) $$

我们可以将它们放在一起以找到特定卷积核权重的损耗梯度:

$$ \begin{aligned} \frac{\partial L}{\partial \text{filter}(x, y)} &= \sum_i \sum_j \frac{\partial L}{\partial \text{out}(i, j)} * \frac{\partial \text{out}(i, j)}{\partial \text{filter}(x, y)} \end{aligned} $$

准备为我们的conv层实现backprop!

# Header: conv.py
class Conv3x3
  # ...

  def backprop(self, d_L_d_out, learn_rate):
    '''
    Performs a backward pass of the conv layer.
    - d_L_d_out is the loss gradient for this layer's outputs.
    - learn_rate is a float.
    '''
    d_L_d_filters = np.zeros(self.filters.shape)

    for im_region, i, j in self.iterate_regions(self.last_input):
      for f in range(self.num_filters):
        d_L_d_filters[f] += d_L_d_out[i, j, f] * im_region

    # Update filters
    self.filters -= learn_rate * d_L_d_filters

    # We aren't returning anything here since we use Conv3x3 as
    # the first layer in our CNN. Otherwise, we'd need to return
    # the loss gradient for this layer's inputs, just like every
    # other layer in our CNN.
    return None

我们通过迭代每个图像区域/卷积核并逐步建立损耗梯度来应用导出的方程式。涵盖所有内容后,我们将像以前一样使用SGD 更新self.filters。请注意解释为什么我们返回一个None-输入的损耗梯度的推导与我们刚才做的很相似,留给读者作为练习:)。

这样就完成了!该测试一下了……

训练CNN

我们将对CNN进行一些训练,在训练期间跟踪其进度,然后在单独的测试集上对其进行测试。这是完整的代码:

# Header: cnn.py
import mnist
import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax

# We only use the first 1k examples of each set in the interest of time.
# Feel free to change this if you want.
train_images = mnist.train_images()[:1000]
train_labels = mnist.train_labels()[:1000]
test_images = mnist.test_images()[:1000]
test_labels = mnist.test_labels()[:1000]

conv = Conv3x3(8)                  # 28x28x1 -> 26x26x8
pool = MaxPool2()                  # 26x26x8 -> 13x13x8
softmax = Softmax(13 * 13 * 8, 10) # 13x13x8 -> 10

def forward(image, label):
  '''
  Completes a forward pass of the CNN and calculates the accuracy and
  cross-entropy loss.
  - image is a 2d numpy array
  - label is a digit
  '''
  # We transform the image from [0, 255] to [-0.5, 0.5] to make it easier
  # to work with. This is standard practice.
  out = conv.forward((image / 255) - 0.5)
  out = pool.forward(out)
  out = softmax.forward(out)

  # Calculate cross-entropy loss and accuracy. np.log() is the natural log.
  loss = -np.log(out[label])
  acc = 1 if np.argmax(out) == label else 0

  return out, loss, acc

def train(im, label, lr=.005):
  '''
  Completes a full training step on the given image and label.
  Returns the cross-entropy loss and accuracy.
  - image is a 2d numpy array
  - label is a digit
  - lr is the learning rate
  '''
  # Forward
  out, loss, acc = forward(im, label)

  # Calculate initial gradient
  gradient = np.zeros(10)
  gradient[label] = -1 / out[label]

  # Backprop
  gradient = softmax.backprop(gradient, lr)
  gradient = pool.backprop(gradient)
  gradient = conv.backprop(gradient, lr)

  return loss, acc

print('MNIST CNN initialized!')

# Train the CNN for 3 epochs
for epoch in range(3):
  print('--- Epoch %d ---' % (epoch + 1))

  # Shuffle the training data
  permutation = np.random.permutation(len(train_images))
  train_images = train_images[permutation]
  train_labels = train_labels[permutation]

  # Train!
  loss = 0
  num_correct = 0
  for i, (im, label) in enumerate(zip(train_images, train_labels)):
    if i > 0 and i % 100 == 99:
      print(
        '[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%' %
        (i + 1, loss / 100, num_correct)
      )
      loss = 0
      num_correct = 0

    l, acc = train(im, label)
    loss += l
    num_correct += acc

# Test the CNN
print('\n--- Testing the CNN ---')
loss = 0
num_correct = 0
for im, label in zip(test_images, test_labels):
  _, l, acc = forward(im, label)
  loss += l
  num_correct += acc

num_tests = len(test_images)
print('Test Loss:', loss / num_tests)
print('Test Accuracy:', num_correct / num_tests)

Example output from running the code:

MNIST CNN initialized!
--- Epoch 1 ---
[Step 100] Past 100 steps: Average Loss 2.254 | Accuracy: 18%
[Step 200] Past 100 steps: Average Loss 2.167 | Accuracy: 30%
[Step 300] Past 100 steps: Average Loss 1.676 | Accuracy: 52%
[Step 400] Past 100 steps: Average Loss 1.212 | Accuracy: 63%
[Step 500] Past 100 steps: Average Loss 0.949 | Accuracy: 72%
[Step 600] Past 100 steps: Average Loss 0.848 | Accuracy: 74%
[Step 700] Past 100 steps: Average Loss 0.954 | Accuracy: 68%
[Step 800] Past 100 steps: Average Loss 0.671 | Accuracy: 81%
[Step 900] Past 100 steps: Average Loss 0.923 | Accuracy: 67%
[Step 1000] Past 100 steps: Average Loss 0.571 | Accuracy: 83%
--- Epoch 2 ---
[Step 100] Past 100 steps: Average Loss 0.447 | Accuracy: 89%
[Step 200] Past 100 steps: Average Loss 0.401 | Accuracy: 86%
[Step 300] Past 100 steps: Average Loss 0.608 | Accuracy: 81%
[Step 400] Past 100 steps: Average Loss 0.511 | Accuracy: 83%
[Step 500] Past 100 steps: Average Loss 0.584 | Accuracy: 89%
[Step 600] Past 100 steps: Average Loss 0.782 | Accuracy: 72%
[Step 700] Past 100 steps: Average Loss 0.397 | Accuracy: 84%
[Step 800] Past 100 steps: Average Loss 0.560 | Accuracy: 80%
[Step 900] Past 100 steps: Average Loss 0.356 | Accuracy: 92%
[Step 1000] Past 100 steps: Average Loss 0.576 | Accuracy: 85%
--- Epoch 3 ---
[Step 100] Past 100 steps: Average Loss 0.367 | Accuracy: 89%
[Step 200] Past 100 steps: Average Loss 0.370 | Accuracy: 89%
[Step 300] Past 100 steps: Average Loss 0.464 | Accuracy: 84%
[Step 400] Past 100 steps: Average Loss 0.254 | Accuracy: 95%
[Step 500] Past 100 steps: Average Loss 0.366 | Accuracy: 89%
[Step 600] Past 100 steps: Average Loss 0.493 | Accuracy: 89%
[Step 700] Past 100 steps: Average Loss 0.390 | Accuracy: 91%
[Step 800] Past 100 steps: Average Loss 0.459 | Accuracy: 87%
[Step 900] Past 100 steps: Average Loss 0.316 | Accuracy: 92%
[Step 1000] Past 100 steps: Average Loss 0.460 | Accuracy: 87%

--- Testing the CNN ---
Test Loss: 0.5979384893783474
Test Accuracy: 0.78

我们的代码有效!在仅有3000个训练步骤中,我们从损失为2.3且准确度为10%的模型变为损失为0.6且准确度为78%的模型。

是否想亲自尝试或修改此代码?在浏览器中运行此CNNGithub上也可以找到该代码。

为了节省时间,在此示例中,我们仅使用了整个MNIST数据集的一个子集-CNN的实现并不是特别快。如果我们想训练MNIST CNN使其更加有效,我们会使用ML库如Keras。为了说明我们的CNN的功能,我使用Keras训练了和本文完全相同的 的CNN:

# Header: cnn_keras.py
import numpy as np
import mnist
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from keras.utils import to_categorical
from keras.optimizers import SGD

train_images = mnist.train_images()
train_labels = mnist.train_labels()
test_images = mnist.test_images()
test_labels = mnist.test_labels()

train_images = (train_images / 255) - 0.5
test_images = (test_images / 255) - 0.5

train_images = np.expand_dims(train_images, axis=3)
test_images = np.expand_dims(test_images, axis=3)

model = Sequential([
  Conv2D(8, 3, input_shape=(28, 28, 1), use_bias=False),
  MaxPooling2D(pool_size=2),
  Flatten(),
  Dense(10, activation='softmax'),
])

model.compile(SGD(lr=.005), loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(
  train_images,
  to_categorical(train_labels),
  batch_size=1,
  epochs=3,
  validation_data=(test_images, to_categorical(test_labels)),
)

在完整的 MNIST数据集(60k训练图像)上运行该代码将得到如下结果:

Epoch 1
loss: 0.2433 - acc: 0.9276 - val_loss: 0.1176 - val_acc: 0.9634
Epoch 2
loss: 0.1184 - acc: 0.9648 - val_loss: 0.0936 - val_acc: 0.9721
Epoch 3
loss: 0.0930 - acc: 0.9721 - val_loss: 0.0778 - val_acc: 0.9744

通过这个简单的CNN,我们可以达到97.4%的测试准确性!有了更好的CNN架构,我们可以进一步改善-在这个官方的Keras MNIST CNN示例中,它们在12个周期后达到了99.25%的准确度。

不熟悉Keras?阅读有关使用Keras构建第一个神经网络使用Keras训练CNN实现CNN的教程。

本文所有代码均可在 Github上找到.

Last modification:October 24th, 2019 at 01:19 am