学习率优化方式

背景:调整学习率可能会对模型提升1~2%,当前我们采用的学习率比较简单粗暴,使用指定学习率,为了优化模型,现对学习率进行探索,希望对我们模型的指标结果有所提升

一、工业界学习率现存形式

1、指定学习率

固定学习率,只在模型训练的全过程中,学习率只采用某个固定值,如0.15,进行模型训练

2、指定学习率+衰减

指定学习率+衰减,顾名思义是,满足一定条件内,使用固定学习率,如0.15,达到条件后,采用衰减的形式

代码如下:

指定学习率+衰减

initial_learning_rate = 0.15

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(

    initial_learning_rate,

    decay_steps=100000,

    decay_rate=0.8,

    staircase=True)

3、warm up

谈到warm之前,总结一下,学习率几种常见的衰减形式

3.1 衰减函数

  • 指数衰减

tf.train.exponential_decay()

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay

指数衰减 代码

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(

        initial_learning_rate,

        decay_steps=100000, # 衰减周期,当staircase=True时,学习率在decay_steps内保持不变,即得到离散型学习率;

        decay_rate=0.8,#衰减率系数

        staircase=True) # 是否定义为离散型学习率,默认False

def decayed_learning_rate(step):

  return initial_learning_rate * decay_rate ^ (step / decay_steps)
  # 当为False时,需要加一个判断语句
  
  • 倒数衰减
tf.train.inverse_time_decay()

tf.keras.optimizers.schedules.InverseTimeDecay

倒数衰减代码

tf.keras.optimizers.schedules.InverseTimeDecay(

    initial_learning_rate, decay_steps, decay_rate, staircase=False, name=None

)


# staircase=False

def decayed_learning_rate(step):

  return initial_learning_rate / (1 + decay_rate * step / decay_step)
  • 分段常数衰减:

tf.train.piecewise_constant

tf.keras.optimizers.schedules.PiecewiseConstantDecay

分段常数衰减

# tf1.x系列

tf.train.piecewise_constant(

    x,

    boundaries, # 列表,表示分割的边界;

    values, # 列表,分段学习率的取值;

    name=None):

 

# parameter

global_step = tf.Variable(0, trainable=False)

boundaries = [100, 200]

values = [1.0, 0.5, 0.1]

# learning_rate

learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)

# 解释

# 当global_step=[1, 100]时,learning_rate=1.0;

# 当global_step=[101, 200]时,learning_rate=0.5;

# 当global_step=[201, ~]时,learning_rate=0.1;

 

# tf2.+可用,需转化

tf.compat.v1.train.piecewise_constant() 或者

tf.keras.optimizers.schedules.PiecewiseConstantDecay(

    boundaries, values, name=None

)

#  use a learning rate that's 1.0 for the first 100001 steps, 0.5 for the next 10000 steps, and 0.1 for any additional steps.

step = tf.Variable(0, trainable=False)

boundaries = [100000, 110000]

values = [1.0, 0.5, 0.1]

learning_rate_fn = keras.optimizers.schedules.PiecewiseConstantDecay(

    boundaries, values)

# Later, whenever we perform an optimization step, we pass in the step.

learning_rate = learning_rate_fn(step)
  • 自然指数衰减
tf.train.natural_exp_decay()

类似与指数衰减,同样与当前迭代次数相关,只不过以e为底;自然指数衰减对学习率的衰减程度远大于一般的指数衰减

  • 自然指数衰减
tf.train.natural_exp_decay(

    learning_rate,

    global_step, # 迭代次数;

    decay_steps, # 衰减周期,当staircase=True时,学习率在decay_steps内保持不变,即得到离散型学习率;

    decay_rate,

    staircase=False,

    name=None

)

decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)
# 如果staircase=True,则学习率会在得到离散值,每decay_steps迭代次数,更新一次;
  • 阶梯型衰减
learing_rate1 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)

 
  • 标准指数型衰减
learing_rate2 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
  • 多项式衰减tf.train.polynomial_decay()

多项式衰减中设置学习率可以往复升降的目的:时为了防止在神经网络训练后期由于学习率过小,导致网络参数陷入局部最优,将学习率升高,有可能使其跳出局部最优;

多项式衰减 代码

tf.train.polynomial_decay(

    learning_rate,

    global_step, # 迭代次数;

    decay_steps, # 衰减周期;

    end_learning_rate=0.0001, # 最小学习率,默认0.0001;

    power=1.0, # 多项式的幂,默认1;

    cycle=False,# bool,表示达到最低学习率时,是否升高再降低,默认False;

 name=None):

  • 余弦衰减

标准余弦衰减tf.train.cosine_decay()、重启余弦衰减、线性余弦噪声、噪声余弦衰减

--------------标准余弦衰减-------------------
tf.train.cosine_decay(

    learning_rate, # 初始学习率;

    global_step, # 迭代次数

    decay_steps, # 衰减周期;

    alpha=0.0, # 最小学习率,默认为0;

    name=None):

计算方式

global_step = min(global_step, decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * global_step / decay_steps))

decayed = (1 - alpha) * cosine_decay + alpha

decayed_learning_rate = learning_rate * decayed
--------------重启余弦衰减-------------------
tf.train.cosine_decay_restarts(

    learning_rate,#初始学习率;

    global_step, # 迭代次数

    first_decay_steps,衰减周期;

    t_mul=2.0,# Used to derive the number of iterations in the i-th period

    m_mul=1.0,# Used to derive the initial learning rate of the i-th period

    alpha=0.0,最小学习率,默认为0;

    name=None):
--------------线性余弦噪声-------------------
tf.train.linear_cosine_decay(

    learning_rate,# 初始学习率;

    global_step,# 迭代次数;

    decay_steps,# 衰减周期;

    num_periods=0.5,# Number of periods in the cosine part of the decay.

    alpha=0.0,# 最小学习率;

    beta=0.001,#

    name=None):

计算方式

global_step = min(global_step, decay_steps)

linear_decay = (decay_steps - global_step) / decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))

decayed = (alpha + linear_decay) * cosine_decay + beta

decayed_learning_rate = learning_rate * decayed
--------------噪声余弦衰减-------------------
tf.train.noisy_linear_cosine_decay(

learning_rate,# 初始学习率;

global_step, #迭代次数;

decay_steps, # 衰减周期;

initial_variance=1.0, # initial variance for the noise.

variance_decay=0.55, # decay for the noise's variance.

num_periods=0.5, # Number of periods in the cosine part of the decay

alpha=0.0, # 最小学习率;

beta=0.001, #

name=None):计算方式global_step = min(global_step, decay_steps)

linear_decay = (decay_steps - global_step) / decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))

decayed = (alpha + linear_decay + eps_t) * cosine_decay + beta

decayed_learning_rate = learning_rate * decayed

3.2 warm up

学习率预热(warmup )就是在刚开始训练的时候先使用一个较小的学习率,训练一些epoches或iterations,等模型稳定时再修改为预先设置的学习率进行训练。
具体做法很简单,比如ResNet原论文[1]中,batch size为256时选择的学习率是0.1,当我们把batch size变为一个较大的数b时,学习率应该变为 0.1 × b/256。
在warmup之后的训练过程中,学习率不断衰减是一个提高精度的好方法。其中有step decay和cosine decay等,前者是随着epoch增大学习率不断减去一个小的数,后者是让学习率随着训练过程曲线下降。
warm up 代码

if warmup:

    warmup_steps = int(batches_per_epoch * 5)

    warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32))

    return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)

def get_scheduler(optimizer, opt):

    """Return a learning rate scheduler

        Parameters:

        optimizer -- 网络优化器

        opt.lr_policy -- 学习率scheduler的名称: linear | step | plateau | cosine

    """

    if opt.lr_policy == 'linear':

        def lambda_rule(epoch):

            lr_l = 1.0 - max(0, epoch + opt.epoch_count - opt.niter) / float(opt.niter_decay + 1)

            return lr_l

 

    scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda_rule)

    elif opt.lr_policy == 'step':

        scheduler = lr_scheduler.StepLR(optimizer, step_size=opt.lr_decay_iters, gamma=0.1)

    elif opt.lr_policy == 'plateau':

        scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.2, threshold=0.01, patience=5)

    elif opt.lr_policy == 'cosine':

        scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=opt.niter, eta_min=0)

    else:

        return NotImplementedError('learning rate policy [%s] is not implemented', opt.lr_policy)

    return scheduler


4、周期性学习率

CLR是Leslie Smith于2015年提出的。这是一种调节LR的方法,在该方法中,设定一个LR上限和下限,LR的值在上限和下限的区间里周期性地变化。看上去,CLR似乎是自适应LR技术和SGD的竞争者,事实上,CLR技术是可以和上述提到的改进的优化器一起使用来进行参数更新

参考:
1、https://www.cnblogs.com/chenzhen0530/p/10632937.html
2、https://blog.csdn.net/weixin_43896398/article/details/84762886
3、https://zhuanlan.zhihu.com/p/66080948