学习率优化方式
背景:调整学习率可能会对模型提升1~2%,当前我们采用的学习率比较简单粗暴,使用指定学习率,为了优化模型,现对学习率进行探索,希望对我们模型的指标结果有所提升
一、工业界学习率现存形式
1、指定学习率
固定学习率,只在模型训练的全过程中,学习率只采用某个固定值,如0.15,进行模型训练
2、指定学习率+衰减
指定学习率+衰减,顾名思义是,满足一定条件内,使用固定学习率,如0.15,达到条件后,采用衰减的形式
代码如下:
指定学习率+衰减
initial_learning_rate = 0.15
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate,
decay_steps=100000,
decay_rate=0.8,
staircase=True)
3、warm up
谈到warm之前,总结一下,学习率几种常见的衰减形式
3.1 衰减函数
- 指数衰减
tf.train.exponential_decay()
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay
指数衰减 代码
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate,
decay_steps=100000, # 衰减周期,当staircase=True时,学习率在decay_steps内保持不变,即得到离散型学习率;
decay_rate=0.8,#衰减率系数
staircase=True) # 是否定义为离散型学习率,默认False
def decayed_learning_rate(step):
return initial_learning_rate * decay_rate ^ (step / decay_steps)
# 当为False时,需要加一个判断语句
- 倒数衰减
tf.train.inverse_time_decay()
tf.keras.optimizers.schedules.InverseTimeDecay
倒数衰减代码
tf.keras.optimizers.schedules.InverseTimeDecay(
initial_learning_rate, decay_steps, decay_rate, staircase=False, name=None
)
# staircase=False
def decayed_learning_rate(step):
return initial_learning_rate / (1 + decay_rate * step / decay_step)
- 分段常数衰减:
tf.train.piecewise_constant
tf.keras.optimizers.schedules.PiecewiseConstantDecay
分段常数衰减
# tf1.x系列
tf.train.piecewise_constant(
x,
boundaries, # 列表,表示分割的边界;
values, # 列表,分段学习率的取值;
name=None):
# parameter
global_step = tf.Variable(0, trainable=False)
boundaries = [100, 200]
values = [1.0, 0.5, 0.1]
# learning_rate
learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)
# 解释
# 当global_step=[1, 100]时,learning_rate=1.0;
# 当global_step=[101, 200]时,learning_rate=0.5;
# 当global_step=[201, ~]时,learning_rate=0.1;
# tf2.+可用,需转化
tf.compat.v1.train.piecewise_constant() 或者
tf.keras.optimizers.schedules.PiecewiseConstantDecay(
boundaries, values, name=None
)
# use a learning rate that's 1.0 for the first 100001 steps, 0.5 for the next 10000 steps, and 0.1 for any additional steps.
step = tf.Variable(0, trainable=False)
boundaries = [100000, 110000]
values = [1.0, 0.5, 0.1]
learning_rate_fn = keras.optimizers.schedules.PiecewiseConstantDecay(
boundaries, values)
# Later, whenever we perform an optimization step, we pass in the step.
learning_rate = learning_rate_fn(step)
- 自然指数衰减
tf.train.natural_exp_decay()
类似与指数衰减,同样与当前迭代次数相关,只不过以e为底;自然指数衰减对学习率的衰减程度远大于一般的指数衰减
- 自然指数衰减
tf.train.natural_exp_decay(
learning_rate,
global_step, # 迭代次数;
decay_steps, # 衰减周期,当staircase=True时,学习率在decay_steps内保持不变,即得到离散型学习率;
decay_rate,
staircase=False,
name=None
)
decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)
# 如果staircase=True,则学习率会在得到离散值,每decay_steps迭代次数,更新一次;
- 阶梯型衰减
learing_rate1 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)
- 标准指数型衰减
learing_rate2 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
- 多项式衰减tf.train.polynomial_decay()
多项式衰减中设置学习率可以往复升降的目的:时为了防止在神经网络训练后期由于学习率过小,导致网络参数陷入局部最优,将学习率升高,有可能使其跳出局部最优;
多项式衰减 代码
tf.train.polynomial_decay(
learning_rate,
global_step, # 迭代次数;
decay_steps, # 衰减周期;
end_learning_rate=0.0001, # 最小学习率,默认0.0001;
power=1.0, # 多项式的幂,默认1;
cycle=False,# bool,表示达到最低学习率时,是否升高再降低,默认False;
name=None):
- 余弦衰减
标准余弦衰减tf.train.cosine_decay()、重启余弦衰减、线性余弦噪声、噪声余弦衰减
--------------标准余弦衰减-------------------
tf.train.cosine_decay(
learning_rate, # 初始学习率;
global_step, # 迭代次数
decay_steps, # 衰减周期;
alpha=0.0, # 最小学习率,默认为0;
name=None):
计算方式
global_step = min(global_step, decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * global_step / decay_steps))
decayed = (1 - alpha) * cosine_decay + alpha
decayed_learning_rate = learning_rate * decayed
--------------重启余弦衰减-------------------
tf.train.cosine_decay_restarts(
learning_rate,#初始学习率;
global_step, # 迭代次数
first_decay_steps,衰减周期;
t_mul=2.0,# Used to derive the number of iterations in the i-th period
m_mul=1.0,# Used to derive the initial learning rate of the i-th period
alpha=0.0,最小学习率,默认为0;
name=None):
--------------线性余弦噪声-------------------
tf.train.linear_cosine_decay(
learning_rate,# 初始学习率;
global_step,# 迭代次数;
decay_steps,# 衰减周期;
num_periods=0.5,# Number of periods in the cosine part of the decay.
alpha=0.0,# 最小学习率;
beta=0.001,#
name=None):
计算方式
global_step = min(global_step, decay_steps)
linear_decay = (decay_steps - global_step) / decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))
decayed = (alpha + linear_decay) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed
--------------噪声余弦衰减-------------------
tf.train.noisy_linear_cosine_decay(
learning_rate,# 初始学习率;
global_step, #迭代次数;
decay_steps, # 衰减周期;
initial_variance=1.0, # initial variance for the noise.
variance_decay=0.55, # decay for the noise's variance.
num_periods=0.5, # Number of periods in the cosine part of the decay
alpha=0.0, # 最小学习率;
beta=0.001, #
name=None):计算方式global_step = min(global_step, decay_steps)
linear_decay = (decay_steps - global_step) / decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))
decayed = (alpha + linear_decay + eps_t) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed
3.2 warm up
学习率预热(warmup )就是在刚开始训练的时候先使用一个较小的学习率,训练一些epoches或iterations,等模型稳定时再修改为预先设置的学习率进行训练。
具体做法很简单,比如ResNet原论文[1]中,batch size为256时选择的学习率是0.1,当我们把batch size变为一个较大的数b时,学习率应该变为 0.1 × b/256。
在warmup之后的训练过程中,学习率不断衰减是一个提高精度的好方法。其中有step decay和cosine decay等,前者是随着epoch增大学习率不断减去一个小的数,后者是让学习率随着训练过程曲线下降。
warm up 代码
if warmup:
warmup_steps = int(batches_per_epoch * 5)
warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32))
return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)
def get_scheduler(optimizer, opt):
"""Return a learning rate scheduler
Parameters:
optimizer -- 网络优化器
opt.lr_policy -- 学习率scheduler的名称: linear | step | plateau | cosine
"""
if opt.lr_policy == 'linear':
def lambda_rule(epoch):
lr_l = 1.0 - max(0, epoch + opt.epoch_count - opt.niter) / float(opt.niter_decay + 1)
return lr_l
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda_rule)
elif opt.lr_policy == 'step':
scheduler = lr_scheduler.StepLR(optimizer, step_size=opt.lr_decay_iters, gamma=0.1)
elif opt.lr_policy == 'plateau':
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.2, threshold=0.01, patience=5)
elif opt.lr_policy == 'cosine':
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=opt.niter, eta_min=0)
else:
return NotImplementedError('learning rate policy [%s] is not implemented', opt.lr_policy)
return scheduler
4、周期性学习率
CLR是Leslie Smith于2015年提出的。这是一种调节LR的方法,在该方法中,设定一个LR上限和下限,LR的值在上限和下限的区间里周期性地变化。看上去,CLR似乎是自适应LR技术和SGD的竞争者,事实上,CLR技术是可以和上述提到的改进的优化器一起使用来进行参数更新
参考:
1、https://www.cnblogs.com/chenzhen0530/p/10632937.html
2、https://blog.csdn.net/weixin_43896398/article/details/84762886
3、https://zhuanlan.zhihu.com/p/66080948