【Paper】复现VideoMAE

论文信息

论文全名VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

发表在NeurIPS 2022

论文链接 https://arxiv.org/abs/2203.12602v3

论文官方代码https://github.com/MCG-NJU/VideoMAE

paperwithcode链接https://paperswithcode.com/paper/videomae-masked-autoencoders-are-data-1

他们还在CVPR 2023上发表了最新的VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking,如果有时间我也会复现

环境准备

Python 3.6 or higher

PyTorch and torchvision.
We can successfully reproduce the main results under two settings below:
Tesla A100 (40G): CUDA 11.1 + PyTorch 1.8.0 + torchvision 0.9.0
Tesla V100 (32G): CUDA 10.1 + PyTorch 1.6.0 + torchvision 0.7.0

timm==0.4.8/0.4.12

deepspeed==0.5.8

DS_BUILD_OPS=1 pip install deepspeed

TensorboardX

decord

einops

conda create -n mae python=3.9

conda activate mae

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

pip install timm==0.4.8

pip install deepspeed==0.5.8

pip install TensorboardX

pip install decord 

pip install einops

后面几个几乎都只能pip install而不能conda install

数据集准备

Kinetics 400(152GB,不推荐)

这里有处理好(视频短边为320px)的K400数据集下载,但需要注册登录按着这个操作https://opendatalab.com/Kinetics-400/cli

官网https://www.deepmind.com/open-source/kinetics

Something-Something-V2(19.4GB)

官网链接https://developer.qualcomm.com/software/ai-datasets/something-something

或者pan https://pan.baidu.com/s/1c1AQn29jLJkJt4CzbrVmsQ ,提取码6666

数据集下载后的解压参考https://blog.csdn.net/weixin_43759637/article/details/131351983

通过将视频扩展名从webm更改为.mp4(原始高度为240px)来预处理数据集,预处理可以参考https://github.com/MCG-NJU/VideoMAE/issues/62,我这里运行失败了,而且https://github.com/MCG-NJU/VideoMAE/issues?page=3&q=is%3Aissue+is%3Aclosed中说的预处理脚本也没了,所以我自己写了个预处理脚本

生成数据加载器所需的注释(注释中的“<path_to_video><video_class>”)。注释通常包括train.csv、val.csv和test.csv(此处test.csv与val.csv相同)。
train.csv、val.csv和test.csv下载:
https://drive.google.com/drive/folders/1cfA-SrPhDB9B8ZckPvnh8D5ysCjD-S_I?usp=share_link
w我自己写的预处理代码

import os
import argparse
import moviepy.editor as mp


def resize_and_convert_video(input_file, output_file, target_short_edge=320, output_format='mp4'):
    # 读取输入视频
    video = mp.VideoFileClip(input_file)
    
    # 计算长边和短边的缩放比例
    width, height = video.size
    if width > height:
        new_width = target_short_edge
        new_height = int(height * (target_short_edge / width))
    else:
        new_height = target_short_edge
        new_width = int(width * (target_short_edge / height))
    
    # 调整视频尺寸
    resized_video = video.resize(height=new_height, width=new_width)
    
    # 转换视频格式并保存
    output_file_path, _ = os.path.splitext(output_file)
    output_file_path += f'.{output_format}'
    resized_video.write_videofile(output_file_path, codec='libx264', audio_codec='aac')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Resize and convert videos in a directory')
    parser.add_argument('input_dir', type=str, help='input directory containing webm videos')
    parser.add_argument('output_dir', type=str, help='output directory for resized and converted videos')
    args = parser.parse_args()

    # 创建输出目录
    os.makedirs(args.output_dir, exist_ok=True)

    # 遍历输入目录中的所有webm文件
    for file_name in os.listdir(args.input_dir):
        if file_name.lower().endswith('.webm'):
            # 构建输入和输出文件路径
            input_file = os.path.join(args.input_dir, file_name)
            output_file = os.path.join(args.output_dir, os.path.splitext(file_name)[0] + '.mp4')

            # 调整尺寸并转换格式
            resize_and_convert_video(input_file, output_file, target_short_edge=320, output_format='mp4')

python resize_convert_videos.py input_dir output_dir

UCF101 (约6.5GB)

官网链接https://www.crcv.ucf.edu/data/UCF101.php

具体可以参考UCF101动作识别数据集简介绍及数据预处理

视频数据集UCF101的处理与加载

无需对UCF101进行其他预处理操作https://github.com/MCG-NJU/VideoMAE/issues/35 https://github.com/MCG-NJU/VideoMAE/issues/69

但是要划分数据集,作者没有给出数据集如何划分https://github.com/MCG-NJU/VideoMAE/issues/35

checkpoint下载https://drive.google.com/file/d/1MSyon6fPpKz7oqD6WDGPFK4k_Rbyb6fw/view