学习率优化方式

背景：调整学习率可能会对模型提升1~2%，当前我们采用的学习率比较简单粗暴，使用指定学习率，为了优化模型，现对学习率进行探索，希望对我们模型的指标结果有所提升

一、工业界学习率现存形式

1、指定学习率

固定学习率，只在模型训练的全过程中，学习率只采用某个固定值，如0.15，进行模型训练

2、指定学习率+衰减

指定学习率+衰减，顾名思义是，满足一定条件内，使用固定学习率，如0.15，达到条件后，采用衰减的形式

代码如下：

指定学习率+衰减

initial_learning_rate = 0.15

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(

    initial_learning_rate,

    decay_steps=100000,

    decay_rate=0.8,

    staircase=True)

3、warm up

谈到warm之前，总结一下，学习率几种常见的衰减形式

3.1 衰减函数

指数衰减

tf.train.exponential_decay()

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay

指数衰减代码

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(

        initial_learning_rate,

        decay_steps=100000, # 衰减周期，当staircase=True时，学习率在decay_steps内保持不变，即得到离散型学习率；

        decay_rate=0.8,#衰减率系数

        staircase=True) # 是否定义为离散型学习率，默认False

def decayed_learning_rate(step):

  return initial_learning_rate * decay_rate ^ (step / decay_steps)
  # 当为False时，需要加一个判断语句

倒数衰减

tf.train.inverse_time_decay()

tf.keras.optimizers.schedules.InverseTimeDecay

倒数衰减代码

tf.keras.optimizers.schedules.InverseTimeDecay(

    initial_learning_rate, decay_steps, decay_rate, staircase=False, name=None

)


# staircase=False

def decayed_learning_rate(step):

  return initial_learning_rate / (1 + decay_rate * step / decay_step)

分段常数衰减:

tf.train.piecewise_constant

tf.keras.optimizers.schedules.PiecewiseConstantDecay

分段常数衰减

# tf1.x系列

tf.train.piecewise_constant(

    x,

    boundaries, # 列表，表示分割的边界；

    values, # 列表，分段学习率的取值；

    name=None):

 

# parameter

global_step = tf.Variable(0, trainable=False)

boundaries = [100, 200]

values = [1.0, 0.5, 0.1]

# learning_rate

learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)

# 解释

# 当global_step=[1, 100]时，learning_rate=1.0;

# 当global_step=[101, 200]时，learning_rate=0.5;

# 当global_step=[201, ~]时，learning_rate=0.1;

 

# tf2.+可用，需转化

tf.compat.v1.train.piecewise_constant() 或者

tf.keras.optimizers.schedules.PiecewiseConstantDecay(

    boundaries, values, name=None

)

#  use a learning rate that's 1.0 for the first 100001 steps, 0.5 for the next 10000 steps, and 0.1 for any additional steps.

step = tf.Variable(0, trainable=False)

boundaries = [100000, 110000]

values = [1.0, 0.5, 0.1]

learning_rate_fn = keras.optimizers.schedules.PiecewiseConstantDecay(

    boundaries, values)

# Later, whenever we perform an optimization step, we pass in the step.

learning_rate = learning_rate_fn(step)

自然指数衰减

tf.train.natural_exp_decay()

类似与指数衰减，同样与当前迭代次数相关，只不过以e为底；自然指数衰减对学习率的衰减程度远大于一般的指数衰减

自然指数衰减

tf.train.natural_exp_decay(

    learning_rate,

    global_step, # 迭代次数；

    decay_steps, # 衰减周期，当staircase=True时，学习率在decay_steps内保持不变，即得到离散型学习率；

    decay_rate,

    staircase=False,

    name=None

)

decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)
# 如果staircase=True，则学习率会在得到离散值，每decay_steps迭代次数，更新一次；

阶梯型衰减

learing_rate1 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)

标准指数型衰减

learing_rate2 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)

多项式衰减tf.train.polynomial_decay()

多项式衰减中设置学习率可以往复升降的目的：时为了防止在神经网络训练后期由于学习率过小，导致网络参数陷入局部最优，将学习率升高，有可能使其跳出局部最优；

多项式衰减代码

tf.train.polynomial_decay(

    learning_rate,

    global_step, # 迭代次数；

    decay_steps, # 衰减周期；

    end_learning_rate=0.0001, # 最小学习率，默认0.0001；

    power=1.0, # 多项式的幂，默认1；

    cycle=False,# bool，表示达到最低学习率时，是否升高再降低，默认False；

 name=None):

余弦衰减

标准余弦衰减tf.train.cosine_decay()、重启余弦衰减、线性余弦噪声、噪声余弦衰减

--------------标准余弦衰减-------------------

tf.train.cosine_decay(

    learning_rate, # 初始学习率；

    global_step, # 迭代次数

    decay_steps, # 衰减周期；

    alpha=0.0, # 最小学习率，默认为0；

    name=None):

计算方式

global_step = min(global_step, decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * global_step / decay_steps))

decayed = (1 - alpha) * cosine_decay + alpha

decayed_learning_rate = learning_rate * decayed

--------------重启余弦衰减-------------------

tf.train.cosine_decay_restarts(

    learning_rate,#初始学习率；

    global_step, # 迭代次数

    first_decay_steps,衰减周期；

    t_mul=2.0,# Used to derive the number of iterations in the i-th period

    m_mul=1.0,# Used to derive the initial learning rate of the i-th period

    alpha=0.0,最小学习率，默认为0；

    name=None):

--------------线性余弦噪声-------------------

tf.train.linear_cosine_decay(

    learning_rate,# 初始学习率；

    global_step,# 迭代次数；

    decay_steps,# 衰减周期；

    num_periods=0.5,# Number of periods in the cosine part of the decay.

    alpha=0.0,# 最小学习率；

    beta=0.001,#

    name=None):

计算方式

global_step = min(global_step, decay_steps)

linear_decay = (decay_steps - global_step) / decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))

decayed = (alpha + linear_decay) * cosine_decay + beta

decayed_learning_rate = learning_rate * decayed

--------------噪声余弦衰减-------------------

tf.train.noisy_linear_cosine_decay(

learning_rate,# 初始学习率；

global_step, #迭代次数；

decay_steps, # 衰减周期；

initial_variance=1.0, # initial variance for the noise.

variance_decay=0.55, # decay for the noise's variance.

num_periods=0.5, # Number of periods in the cosine part of the decay

alpha=0.0, # 最小学习率；

beta=0.001, #

name=None):计算方式global_step = min(global_step, decay_steps)

linear_decay = (decay_steps - global_step) / decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))

decayed = (alpha + linear_decay + eps_t) * cosine_decay + beta

decayed_learning_rate = learning_rate * decayed

3.2 warm up

学习率预热（warmup ）就是在刚开始训练的时候先使用一个较小的学习率，训练一些epoches或iterations，等模型稳定时再修改为预先设置的学习率进行训练。
具体做法很简单，比如ResNet原论文[1]中，batch size为256时选择的学习率是0.1，当我们把batch size变为一个较大的数b时，学习率应该变为 0.1 × b/256。
在warmup之后的训练过程中，学习率不断衰减是一个提高精度的好方法。其中有step decay和cosine decay等，前者是随着epoch增大学习率不断减去一个小的数，后者是让学习率随着训练过程曲线下降。
warm up 代码

if warmup:

    warmup_steps = int(batches_per_epoch * 5)

    warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32))

    return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)

def get_scheduler(optimizer, opt):

    """Return a learning rate scheduler

        Parameters:

        optimizer -- 网络优化器

        opt.lr_policy -- 学习率scheduler的名称: linear | step | plateau | cosine

    """

    if opt.lr_policy == 'linear':

        def lambda_rule(epoch):

            lr_l = 1.0 - max(0, epoch + opt.epoch_count - opt.niter) / float(opt.niter_decay + 1)

            return lr_l

 

    scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda_rule)

    elif opt.lr_policy == 'step':

        scheduler = lr_scheduler.StepLR(optimizer, step_size=opt.lr_decay_iters, gamma=0.1)

    elif opt.lr_policy == 'plateau':

        scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.2, threshold=0.01, patience=5)

    elif opt.lr_policy == 'cosine':

        scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=opt.niter, eta_min=0)

    else:

        return NotImplementedError('learning rate policy [%s] is not implemented', opt.lr_policy)

    return scheduler

4、周期性学习率

CLR是Leslie Smith于2015年提出的。这是一种调节LR的方法，在该方法中，设定一个LR上限和下限，LR的值在上限和下限的区间里周期性地变化。看上去，CLR似乎是自适应LR技术和SGD的竞争者，事实上，CLR技术是可以和上述提到的改进的优化器一起使用来进行参数更新

参考：
1、https://www.cnblogs.com/chenzhen0530/p/10632937.html
2、https://blog.csdn.net/weixin_43896398/article/details/84762886
3、https://zhuanlan.zhihu.com/p/66080948