TensorFlow 笔记(十二):CNN示例代码CIFAR-10分析(下)

本文接上文,继续学习TensorFlow在CIFAR-10上的教程,该代码主要由以下五部分组成:

文件 作用
cifar10_input.py 读取原始的 CIFAR-10 二进制格式文件
cifar10.py 建立 CIFAR-10 网络模型
cifar10_train.py 在单块CPU或者GPU上训练 CIFAR-10 模型
cifar10_multi_gpu_train.py 在多块GPU上训练 CIFAR-10 模型
cifar10_eval.py 在测试集上评估 CIFAR-10 模型的表现

本次主要学习cifar10_train.pycifar10_eval.py两个文件,内容分别为训练模型和评估模型,并最终给出实验过程。

教程地址:https://www.tensorflow.org/tutorials/deep_cnn

代码地址:https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10


训练模型

这部分代码在cifar10_train.py文件中,实现了用单块GPU训练模型,具体训练过程设计为:

  • 共计100万次迭代(自己实验时改成了10万次)
  • batch_size为128
  • 每10次迭代打印一次训练数据(损失、样本/秒、秒/batch)
  • 每600s保存一次checkpoint文件
  • 每300s对最新的checkpoint文件执行一次评估
  • 每100次迭代保存一次summary


具体代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
"""用单块GPU训练CIFAR-10"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from datetime import datetime
import time
import tensorflow as tf
import cifar10
#作用类似于argparse,通过命令行传参改变训练参数
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
tf.app.flags.DEFINE_integer('max_steps', 1000000,
"""Number of batches to run.""")
tf.app.flags.DEFINE_boolean('log_device_placement', False,
"""Whether to log device placement.""")
tf.app.flags.DEFINE_integer('log_frequency', 10,
"""How often to log results to the console.""")
def train():
"""训练 CIFAR-10 数据集."""
with tf.Graph().as_default():
# 返回或创建全局迭代张量(是一个不会被训练的变量)
global_step = tf.contrib.framework.get_or_create_global_step()
# 获得CIFAR-10的训练数据和标签
# 强迫输入管道在 CPU:0 上操作避免有时候操作在GPU上会停止并导致运行变慢
with tf.device('/cpu:0'):
images, labels = cifar10.distorted_inputs()
# 用模型的接口函数inference()建立Graph并且计算logits
logits = cifar10.inference(images)
# 计算损失
loss = cifar10.loss(logits, labels)
# 建立 Graph 并用一个batch的数据来训练模型并更新参数
train_op = cifar10.train(loss, global_step)
class _LoggerHook(tf.train.SessionRunHook):
"""打印损失和运行时间"""
def begin(self):
self._step = -1
self._start_time = time.time()
def before_run(self, run_context):
self._step += 1
return tf.train.SessionRunArgs(loss) # 计算损失
def after_run(self, run_context, run_values):
if self._step % FLAGS.log_frequency == 0:
current_time = time.time()
duration = current_time - self._start_time
self._start_time = current_time
loss_value = run_values.results
# 计算每秒钟训练了多少个样本
examples_per_sec = FLAGS.log_frequency * FLAGS.batch_size / duration
# 计算每次迭代用了多长时间
sec_per_batch = float(duration / FLAGS.log_frequency)
format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f '
'sec/batch)')
print (format_str % (datetime.now(), self._step, loss_value,
examples_per_sec, sec_per_batch))
# 开启一个会话执行训练过程
# tf.train.NanTensorHook(loss):监控loss,如果loss为NaN则停止训练
with tf.train.MonitoredTrainingSession(
checkpoint_dir=FLAGS.train_dir,
hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),
tf.train.NanTensorHook(loss),
_LoggerHook()],
config=tf.ConfigProto(
log_device_placement=FLAGS.log_device_placement)) as mon_sess:
while not mon_sess.should_stop(): # 如果没有到最大迭代次数
mon_sess.run(train_op) # 执行训练过程
def main(argv=None):
cifar10.maybe_download_and_extract() # 下载数据并解压缩
if tf.gfile.Exists(FLAGS.train_dir):
tf.gfile.DeleteRecursively(FLAGS.train_dir)
tf.gfile.MakeDirs(FLAGS.train_dir)
train()
if __name__ == '__main__':
tf.app.run()


补充

1
2
3
4
5
6
7
8
9
10
11
12
13
14
tf.train.MonitoredTrainingSession(
master='',
is_chief=True,
checkpoint_dir=None,
scaffold=None,
hooks=None,
chief_only_hooks=None,
save_checkpoint_secs=600,
save_summaries_steps=100,
save_summaries_secs=None,
config=None,
stop_grace_period_secs=120,
log_step_count_steps=100
)

每600s保存一次checkpoint,每100s保存一次summary


评估模型

这部分代码在cifar10_eval.py中,默认每300s执行一次评估,具体流程:

  • evaluate()负责创建和维护整个评估过程:
  1. 获得测试数据
  2. 搭建神经网络模型(和训练过程一样)
  3. 创建saversaver负责恢复shadow variable的值并赋给variable
  4. 每隔固定的间隔(300s),运行一次eval_once()
  • eval_once()负责完成一次评估,步骤是:
  1. 从checkpoint中取出最新模型
  2. 运行saver.restorecheckpoint中恢复shadow variable的值并赋给variable
  3. 运行神经网络,对测试集的数据按批次进行预测
  4. 计算整个测试集的预测精度


具体代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
"""Evaluation for CIFAR-10"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from datetime import datetime
import math
import time
import numpy as np
import tensorflow as tf
import cifar10
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval',
"""Directory where to write event logs.""")
tf.app.flags.DEFINE_string('eval_data', 'test',
"""Either 'test' or 'train_eval'.""")
tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train',
"""Directory where to read model checkpoints.""")
tf.app.flags.DEFINE_integer('eval_interval_secs', 60 * 5,
"""How often to run the eval.""")
tf.app.flags.DEFINE_integer('num_examples', 10000,
"""Number of examples to run.""")
tf.app.flags.DEFINE_boolean('run_once', False,
"""Whether to run eval only once.""")
def eval_once(saver, summary_writer, top_k_op, summary_op):
"""运行一次评估
输入参数:
saver: Saver.
summary_writer: Summary writer.
top_k_op: Top K op.
summary_op: Summary op.
"""
with tf.Session() as sess:
ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)
if ckpt and ckpt.model_checkpoint_path:
# 从checkpoint恢复变量的值
saver.restore(sess, ckpt.model_checkpoint_path)
# model_checkpoint_path提取最新的checkpoint文件名,看起来如下:
# /my-favorite-path/cifar10_train/model.ckpt-0
# 从中提取出global_step
global_step = ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1]
else:
print('No checkpoint file found')
return
# 开始队列
coord = tf.train.Coordinator()
try:
threads = []
for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
threads.extend(qr.create_threads(sess, coord=coord, daemon=True,
start=True))
num_iter = int(math.ceil(FLAGS.num_examples / FLAGS.batch_size)) # 总的迭代数目
true_count = 0 # 统计预测正确的数目
total_sample_count = num_iter * FLAGS.batch_size # 总的样本数目
step = 0
while step < num_iter and not coord.should_stop():
predictions = sess.run([top_k_op])
true_count += np.sum(predictions)
step += 1
# 计算准确率 @ 1.
precision = true_count / total_sample_count
print('%s: precision @ 1 = %.3f' % (datetime.now(), precision))
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op))
summary.value.add(tag='Precision @ 1', simple_value=precision)
summary_writer.add_summary(summary, global_step)
except Exception as e:
coord.request_stop(e)
coord.request_stop()
coord.join(threads, stop_grace_period_secs=10)
def evaluate():
"""Eval CIFAR-10 for a number of steps."""
with tf.Graph().as_default() as g:
# 从CIFAR-10中获取图像数据和标签数据
eval_data = FLAGS.eval_data == 'test'
images, labels = cifar10.inputs(eval_data=eval_data)
# 建立一个Graph来计算logits
logits = cifar10.inference(images)
# 计算预测值,输出一个batch_size大小的bool数组
top_k_op = tf.nn.in_top_k(logits, labels, 1)
# 恢复训练变量的滑动平均值来评估模型
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
if FLAGS.run_once:
break
time.sleep(FLAGS.eval_interval_secs)
def main(argv=None):
cifar10.maybe_download_and_extract()
if tf.gfile.Exists(FLAGS.eval_dir):
tf.gfile.DeleteRecursively(FLAGS.eval_dir)
tf.gfile.MakeDirs(FLAGS.eval_dir)
evaluate()
if __name__ == '__main__':
tf.app.run()


补充

1
2
3
4
5
6
tf.nn.in_top_k(
predictions,
targets,
k,
name=None
)

判断targets是否在top k的预测之中。输出batch_size大小的bool数组,如果对目标累的预测在所有预测的top k中,则out[i]=True


实验过程

作者在单块Tesla K40中训练了10万次用了8小时(350 - 600 images/sec),我在单块Quadro M5000上只用了46分钟(4800~5000 images/sec),下面是训练过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2017-07-07 15:30:08.459355: step 0, loss = 4.68 (317.5 examples/sec; 0.403 sec/batch)
2017-07-07 15:30:08.794469: step 10, loss = 4.62 (3819.4 examples/sec; 0.034 sec/batch)
2017-07-07 15:30:09.067413: step 20, loss = 4.49 (4689.6 examples/sec; 0.027 sec/batch)
2017-07-07 15:30:09.343335: step 30, loss = 4.45 (4638.9 examples/sec; 0.028 sec/batch)
2017-07-07 15:30:09.618720: step 40, loss = 4.31 (4648.0 examples/sec; 0.028 sec/batch)
2017-07-07 15:30:09.889763: step 50, loss = 4.32 (4722.5 examples/sec; 0.027 sec/batch)
2017-07-07 15:30:10.162925: step 60, loss = 4.26 (4685.9 examples/sec; 0.027 sec/batch)
2017-07-07 15:30:10.436191: step 70, loss = 4.07 (4684.1 examples/sec; 0.027 sec/batch)
2017-07-07 15:30:10.702081: step 80, loss = 4.20 (4814.0 examples/sec; 0.027 sec/batch)
2017-07-07 15:30:10.963494: step 90, loss = 4.26 (4896.6 examples/sec; 0.026 sec/batch)
2017-07-07 15:30:11.442152: step 100, loss = 4.08 (2674.1 examples/sec; 0.048 sec/batch)
...
2017-07-07 16:16:08.694992: step 99900, loss = 0.67 (3468.2 examples/sec; 0.037 sec/batch)
2017-07-07 16:16:08.952094: step 99910, loss = 0.71 (4978.6 examples/sec; 0.026 sec/batch)
2017-07-07 16:16:09.211538: step 99920, loss = 0.65 (4933.6 examples/sec; 0.026 sec/batch)
2017-07-07 16:16:09.472046: step 99930, loss = 0.76 (4913.5 examples/sec; 0.026 sec/batch)
2017-07-07 16:16:09.728118: step 99940, loss = 0.81 (4998.6 examples/sec; 0.026 sec/batch)
2017-07-07 16:16:09.986376: step 99950, loss = 0.77 (4956.3 examples/sec; 0.026 sec/batch)
2017-07-07 16:16:10.241033: step 99960, loss = 0.56 (5026.4 examples/sec; 0.025 sec/batch)
2017-07-07 16:16:10.496853: step 99970, loss = 0.71 (5003.5 examples/sec; 0.026 sec/batch)
2017-07-07 16:16:10.760321: step 99980, loss = 0.64 (4858.3 examples/sec; 0.026 sec/batch)
2017-07-07 16:16:11.018312: step 99990, loss = 0.76 (4961.4 examples/sec; 0.026 sec/batch)

训练和评估过程是放在两个程序分开进行的,具体的实现方法是,在训练过程中,为每个训练变量添加指数滑动平均变量,然后每600s就将模型训练到的变量值保存在checkpoint中,评估过程运行时,从最新存储的checkpoint中取出模型的shadow variable,赋值给对应的变量,然后进行评估。

我们需要同时运行两个程序才能实时的对训练过程进行评估,否则得到的永远只是最新的checkpoint文件中的评估结果。具体可以先运行python cifar_train.py ,再打开另一个窗口运行python cifar_eval.py

官方给的代码最大迭代次数是100万,我运行的时候改成了10万。

因为我的迭代速度太快了,到600s时第一次保存checkpoint就已经是两万多次迭代了,可以通过修改tf.train.MonitoredTrainingSession()函数的save_checkpoint_secs参数来修改保存checkpoint的时间间隔,默认600s。

最终10万次迭代后的评估准确率是86.2%,和官方给出的数据还是吻合的。

最后来张TensorBoard的图:

Total Loss


参考