【连载15】Residual Networks、Maxout Networks和Network in Network / 四六文摘

公众号后台回复“python“，立刻领取100本机器学习必备Python电子书

Residual Networks‍

残差网络在《Deep Residual Learning for Image Recognition》中被第一次提出，作者利用它在ILSVRC 2015的ImageNet 分类、检测、定位任务以及COCO 2015的检测、图像分割任务上均拿到第一名，也证明ResNet是比较通用的框架。

ResNet产生的动机‍

我一直说深度学习的研究很大程度是实验科学，ResNet的研究上也比较能体现这点。一个问题：是否能够通过简单的增加网络层数就能学到更好的模型呢？通过实验发现答案是否定的，并且随着层数的增加预测精度会趋于饱和，然后迅速下降，这个现象叫degradation。

图中可以看到在CIFAR-10数据集上，20层网络在训练集和测试集上的表现都明显好于56层网络，这显然不是过拟合导致的，这个现象也不符合我们的直观映像：按理说多增加一层的模型效果应该好于未增加时的模型，最起码不应该变差，于是作者提出原始的残差学习框架（也可以看成是Highway Networks在T=0.5时的特例）：

这个框架的假设是：多层非线性激活的神经网络学习恒等映射的能力比较弱，直接将恒等映射加入可以跳过这个问题。

与Highway Networks相比：- HN的transform gate和carry

恒等映射‍

恒等映射在深度残差网络中究竟扮演什么角色呢？在《Identity Mappings in Deep Residual Networks》中作者做了分析，a为原始block结构，b为新的结构。

原始结构：

新结构：

其中为Batch Normalization。

在CIFAR-10上用1001层残差网络做测试，效果如下:

新的proposed结构比原始结构效果明显：

双恒等映射下，任何一个残差block如下：

对上述结构做递归展开，任何一个深层block和其所有浅层block的关系为：

这个形式会有很好的计算性质，回想GBDT，是否觉得有点像？在反向传播时同样也有良好的性质：

模型集成角度看残差网络‍

《Residual Networks Behave Like Ensembles of Relatively Shallow Networks》中把残差网络做展开，其实会发现以下关系：

如果有个残差block，展开后会得到2的n次方个路径，于是残差网络就可以看成这么多模型的集成。那么这些路径之间是否有互相依赖关系呢：

可以看到删除VGG任何一层，不管在CIFAR-10还是ImageNet数据集上，准确率立马变得惨不忍睹，而删除残差网络的任何一个block几乎不会影响效果，但删除采样层会对效果影响较大(采样层不存在展开多路径特点)，上面实验表明对残差网络，虽然多路径是联合训练的，但路径间相互没有强依赖性，直观的解释如图：

即使删掉f2这个节点，还有其它路径存在，而非残差结构的路径则会断掉。

残差网络看做集成模型可以通过下面实验结果得到印证：

模型在运行时的效果与有效路径的个数成正比且关系平滑，左图说明残差网络的效果类似集成模型，右图说明实践中残差网络可以在运行时做网络结构修改。

残差网络中的短路径‍

通过残差block的结构可知展开后的个路径的长度服从二项分布X~B（n,1/2)，(每次选择是否跳过权重层的概率是0.5)，所以其期望为n/2：，下面三幅图是在有54个残差block下的实验，第一幅图为路径分布图，可以看到95%的路径长度都在19~35之间：

由于路径长短不同，在反向传播时携带的梯度信息量也不同，路径长度与携带梯度信息量成反比，实验结果如下图：

残差网络中真正有效的路径几乎都是浅层路径，实验中有效路径长度在5~17之间，所以实践中做模型压缩可以先从长路径入手。

虽然残差网络没有解决梯度消失问题，只是把它给绕过了，并没有解决深层神经网络的本质问题，但我们应用时更多的看实践效果。

代码实践

下面我们实现在《Deep Residual Learning for Image Recognition》中提到的ResNet-34，并演示在CIFAR-10下的训练效果。

resnet.py

# -*- coding: utf-8 -*-

from keras import backend as K

from keras.layers.merge import add

from keras.layers import Input, Activation, Dense, Flatten

from keras.layers.convolutional import Conv2D, MaxPooling2D, AveragePooling2D

from keras.layers.normalization import BatchNormalization

from keras.regularizers import l1_l2

from keras.models import Model

class ResNet(object):

'''残差网络基本模块定义'''

name = 'resnet'

def __init__(self, n):

self.name = n

def bn_relu(self, input):

'''构建propoesd残差block中BN与ReLU子结构，针对tensorflow'''

normalize = BatchNormalization(axis=3)(input)

return Activation("relu")(normalize)

def bn_relu_weight(self, filters, kernel_size, strides):

'''构建propoesd残差block中BN->ReLu->Weight的子结构'''

def inner_func(input):

act = self.bn_relu(input)

conv = Conv2D(filters=filters,

kernel_size=kernel_size,

strides=strides,

padding='same',

kernel_initializer='he_normal',

kernel_regularizer=l1_l2(0.0001))(act)

return conv

return inner_func

def weight_bn_relu(self, filters, kernel_size, strides):

'''构建propoesd残差block中BN->ReLu->Weight的子结构'''

def inner_func(input):

return self.bn_relu(Conv2D(filters=filters,

kernel_size=kernel_size,

strides=strides,

padding='same',

kernel_initializer='he_normal',

kernel_regularizer=l1_l2(0.0001))(input))

return inner_func

def shortcut(self, left, right):

'''构建propoesd残差block中恒等映射的子结构，分两种情况，输入、输出维度一致&维度不一致'''

left_shape = K.int_shape(left)

right_shape = K.int_shape(right)

stride_width = int(round(left_shape[1] / right_shape[1]))

stride_height = int(round(left_shape[2] / right_shape[2]))

equal_channels = left_shape[3] == right_shape[3]

x_l = left

# 如果输入输出维度不一致需要通过映射变一致，否则一致则返回单位矩阵，这个映射发生在两个不同维度block之间(论文中虚线部分)

if left_shape != right_shape:

x_l = Conv2D(filters=right_shape[3],

kernel_size=(1, 1),

strides=(int(round(left_shape[1] / right_shape[1])),

int(round(left_shape[2] / right_shape[2]))),

padding="valid",

kernel_initializer="he_normal",

kernel_regularizer=l1_l2(0.01, 0.0001))(left)

x_l_1 = add([x_l, right])

return x_l_1

def basic_block(self, filters, strides=(1, 1), is_first_block=False):

"""34层以内的残差网络使用的block，2层一跨"""

def inner_func(input):

# 恒等映射

if not is_first_block:

conv1 = self.bn_relu_weight(filters=filters,

kernel_size=(3, 3),

strides=strides)(input)

else:

conv1 = Conv2D(filters=filters, kernel_size=(3, 3),

strides=strides,

padding="same",

kernel_initializer="he_normal",

kernel_regularizer=l1_l2(0.01, 0.0001))(input)

# 残差网络

residual = self.bn_relu_weight(filters=filters,

kernel_size=(3, 3), strides=(1, 1))(conv1)

# 构建一个两层的残差block

return self.shortcut(input, residual)

return inner_func

def residual_block(self, block_func, filters, repeat_times, is_first_block):

'''构建多层残差block'''

def inner_func(input):

for i in range(repeat_times):

# 第一个block的第一层，其输入为pooling层

if is_first_block:

strides = (1, 1)

else:

if i == 0: # 每个残差block的第一层

strides = (2, 2)

else: # 每个残差block的非第一层

strides = (1, 1)

flag = i == 0 and is_first_block

input = block_func(filters=filters,

strides=strides,

is_first_block=flag)(input)

return input

return inner_func

def residual_builder(self, input_shape, softmax_num, func_type, repeat_times):

'''指定输入、输出、残差block的类型、网络深度并构建残差网络'''

input = Input(shape=input_shape)

# 第一层为卷积层

conv1 = self.weight_bn_relu(filters=64, kernel_size=(7, 7), strides=(2, 2))(input)

# 第二层为max pooling层

pool1 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding="same")(conv1)

residual_block = pool1

filters = 64

# 接着16个残差block

for i, r in enumerate(repeat_times):

if i == 0:

residual_block = self.residual_block(func_type,

filters=filters,

repeat_times=r,

is_first_block=True)(residual_block)

else:

residual_block = self.residual_block(func_type,

filters=filters,

repeat_times=r,

is_first_block=False)(residual_block)

filters *= 2

residual_block = self.bn_relu(residual_block)

shape = K.int_shape(residual_block)

# average pooling层

pool2 = AveragePooling2D(pool_size=(shape[1], shape[2]),

strides=(1, 1))(residual_block)

flatten1 = Flatten()(pool2)

# 全连接层

dense1 = Dense(units=softmax_num,

kernel_initializer="he_normal",

activation="softmax")(flatten1)

return Model(inputs=input, outputs=dense1)

resnet-cifar-10.py

# -*- coding: utf-8 -*-

import numpy as np

import matplotlib

import resnet

matplotlib.use("Agg")

import matplotlib.pyplot as plt

import os

from scipy.misc import toimage

from keras.datasets import cifar10

from keras.utils import np_utils

from keras.preprocessing.image import ImageDataGenerator

from keras.callbacks import ModelCheckpoint

from keras import backend as K

import tensorflow as tf

tf.python.control_flow_ops = tf

from keras.callbacks import ReduceLROnPlateau, CSVLogger, EarlyStopping

lr_reducer = ReduceLROnPlateau(monitor='val_loss', factor=np.sqrt(0.5), cooldown=0, patience=3, min_lr=1e-6)

early_stopper = EarlyStopping(monitor='val_acc', min_delta=0.0005, patience=15)

csv_logger = CSVLogger('resnet34_cifar10.csv')

def data_visualize(x, y, num):

plt.figure()

for i in range(0, num * num):

axes = plt.subplot(num, num, i + 1)

axes.set_title("label=" + str(y[i]))

axes.set_xticks([0, 10, 20, 30])

axes.set_yticks([0, 10, 20, 30])

plt.imshow(toimage(x[i]))

plt.tight_layout()

plt.savefig('sample.jpg')

if __name__ == "__main__":

from keras.utils.vis_utils import plot_model

with tf.device('/gpu:3'):

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=1, allow_growth=True)

os.environ["CUDA_VISIBLE_DEVICES"] = "3"

tf.Session(config=K.tf.ConfigProto(allow_soft_placement=True,

log_device_placement=True,

gpu_options=gpu_options))

(X_train, y_train), (X_test, y_test) = cifar10.load_data()

data_visualize(X_train, y_train, 4)

# 定义输入数据并做归一化

dim = 32

channel = 3

class_num = 10

X_train = X_train.reshape(X_train.shape[0], dim, dim, channel).astype('float32') / 255

X_test = X_test.reshape(X_test.shape[0], dim, dim, channel).astype('float32') / 255

Y_train = np_utils.to_categorical(y_train, class_num)

Y_test = np_utils.to_categorical(y_test, class_num)

# this will do preprocessing and realtime data augmentation

datagen = ImageDataGenerator(

featurewise_center=False, # set input mean to 0 over the dataset

samplewise_center=False, # set each sample mean to 0

featurewise_std_normalization=False, # divide inputs by std of the dataset

samplewise_std_normalization=False, # divide each input by its std

zca_whitening=False, # apply ZCA whitening

rotation_range=25, # randomly rotate images in the range (degrees, 0 to 180)

width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)

height_shift_range=0.1, # randomly shift images vertically (fraction of total height)

horizontal_flip=True, # randomly flip images

vertical_flip=False) # randomly flip images

datagen.fit(X_train)

s = X_train.shape[1:]

print(s)

builder = resnet.ResNet("ResNet-test")

resnet_34 = builder.residual_builder(s, class_num, builder.basic_block, [3, 4, 6, 3])

model = resnet_34

model.summary()

#import pdb

#pdb.set_trace()

plot_model(model, to_file="ResNet.jpg", show_shapes=True)

model.compile(loss='categorical_crossentropy',

optimizer='adadelta',

metrics=['accuracy'])

batch_size = 32

nb_epoch = 100

# import pdb

# pdb.set_trace()

ModelCheckpoint("weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5", monitor='val_loss', verbose=0,

save_best_only=False, save_weights_only=False, mode='auto')

model.fit_generator(datagen.flow(X_train, Y_train, batch_size=batch_size),

steps_per_epoch=X_train.shape[0],

validation_data=(X_test, Y_test),

epochs=nb_epoch,

verbose=1,

max_q_size=100,

callbacks=[lr_reducer, early_stopper, csv_logger])

score = model.evaluate(X_test, Y_test, verbose=0)

print('Test score:', score[0])

print('Test accuracy:', score[1])

CIFAR-10训练情况

迭代100次后，训练集上Acc为：0.8367，测试集上Acc为0.8346。

Maxout Networks‍

Goodfellow等人在《Maxout Networks》一文中提出，这篇论文值得一看。

Maxout激活函数‍

对于神经网络任意一层可以添加Maxout结构，公式如下：

上面的W和b是要学习的参数，这些参数可以通过反向传播计算，k是事先指定的参数，x是输入节点，假定有以下3层网络结构：

Maxout激活可以认为是在输入节点x和输出节点h中间加了个隐含节点k，以上图节点i为例，上图红色部分在Maxout结构中被扩展为以下结构：

实际上图所示的单个Maxout 单元本质是一个分段线性函数，而任意凸函数都可以通过分段线性函数来拟合，这个可以很直观的理解，以抛物线为例：每个z节点都是一个线性函数，上图z1~z4节点输出对应下图k1~k4线段：

从全局上看，ReLU可以看做Maxout的一种特例，Maxout通过网络自动学习激活函数(从这个角度看Maxout也可以看做某种Network-In-Network结构)，不对k做限制，只要两个Maxout 单元就能拟合任意连续函数，关于这部分论文中有更详细的证明，这里不再赘述，实际上它与Dropout配合效果更好，这里可以回想下核方法(Kernel Method)，核方法采用非线性核（如高斯核）也会有类似通过局部线性拟合来模拟非线性行为，但传统核方法会事先指定核函数（如高斯函数），而不是数据驱动的方式算出来，当然也有kernel组合方面的研究，但在我看来最终和神经网络殊途同归，其实都可以在神经网络的大框架下去思考（回想前面的SVM与神经网络的关系）。

凡事都有两面性，Maxout的缺点也是明显的：多了一倍参数、需要人为指定k值、先验假设被学习的激活函数是凸的。

Network in Network‍

NIN的思想来源于《Network In Network》,其亮点有2个方面：将传统卷积层替换为非线性卷积层以提升特征抽象能力；使用新的pooling层代替传统全连接层，后续出现的各个版本GoogLeNet也很大程度借鉴了这个思想。

NIN卷积层(MLP Convolution)‍

选择MLP的原因是：

· MLP能拟合任意函数，不需要做先验假设(如：线性可分、凸集)；

· MLP与卷积神经网络结构天然兼容，可以通过BP方便的做训练；

· MLP本身也能做的较深，且特征能够得到复用；

· 通过MLP做卷积可以起到feature map级联交叉加权组合的作用，能提升特征抽象能力：