解决报错:RuntimeError: Found more than one stateful callback of type `ModelCheckpoint`.

当我使用 pytorch-lightning 的时候,突发地报了如下错误:

RuntimeError: Found more than one stateful callback of type `ModelCheckpoint`. In the current configuration, this callback does not support being saved alongside other instances of the same type. Please consult the documentation of `ModelCheckpoint` regarding valid settings for the callback state to be checkpointable. HINT: The `callback.state_key` must be unique among all callbacks in the Trainer.

Why 报错?

配置文件中的 modelcheckpoint 和 metrics_over_trainsteps_checkpoint 都有在 every_n_train_steps 属性,如果他们的数值相同就会报此error。

lightning:
  modelcheckpoint:
    params:
      every_n_train_steps: XXX

  callbacks:
    metrics_over_trainsteps_checkpoint:
      params:
        every_n_train_steps: XXX

报错原因及解决方法

保存ckpt的时候,可以按照modelcheckpoint 的every_n_train_steps 进行保存。如果metrics_over_trainsteps_checkpoint 的 every_n_train_steps 与之一样的话,这样是无法在多个ckpt 进行打分,适当选择留 ckpt的。
所以,metrics_over_trainsteps_checkpoint 的 every_n_train_steps需要大于modelcheckpoint 的every_n_train_steps。

如若表达错误,敬请批评指正。