【Paper】复现VideoMAE
论文信息
论文全名VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
发表在NeurIPS 2022
论文链接 https://arxiv.org/abs/2203.12602v3
论文官方代码https://github.com/MCG-NJU/VideoMAE
paperwithcode链接https://paperswithcode.com/paper/videomae-masked-autoencoders-are-data-1
他们还在CVPR 2023上发表了最新的VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking,如果有时间我也会复现
环境准备
Python 3.6 or higher
PyTorch and torchvision.
We can successfully reproduce the main results under two settings below:
Tesla A100 (40G): CUDA 11.1 + PyTorch 1.8.0 + torchvision 0.9.0
Tesla V100 (32G): CUDA 10.1 + PyTorch 1.6.0 + torchvision 0.7.0
timm==0.4.8/0.4.12
deepspeed==0.5.8
DS_BUILD_OPS=1 pip install deepspeed
TensorboardX
decord
einops
conda create -n mae python=3.9
conda activate mae
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install timm==0.4.8
pip install deepspeed==0.5.8
pip install TensorboardX
pip install decord
pip install einops
后面几个几乎都只能pip install而不能conda install
数据集准备
Kinetics 400(152GB,不推荐)
这里有处理好(视频短边为320px)的K400数据集下载,但需要注册登录按着这个操作https://opendatalab.com/Kinetics-400/cli
官网https://www.deepmind.com/open-source/kinetics
Something-Something-V2(19.4GB)
官网链接https://developer.qualcomm.com/software/ai-datasets/something-something
或者pan https://pan.baidu.com/s/1c1AQn29jLJkJt4CzbrVmsQ ,提取码6666
数据集下载后的解压参考https://blog.csdn.net/weixin_43759637/article/details/131351983
通过将视频扩展名从webm更改为.mp4(原始高度为240px)来预处理数据集,预处理可以参考https://github.com/MCG-NJU/VideoMAE/issues/62,我这里运行失败了,而且https://github.com/MCG-NJU/VideoMAE/issues?page=3&q=is%3Aissue+is%3Aclosed中说的预处理脚本也没了,所以我自己写了个预处理脚本
生成数据加载器所需的注释(注释中的“<path_to_video><video_class>”)。注释通常包括train.csv、val.csv和test.csv(此处test.csv与val.csv相同)。
train.csv、val.csv和test.csv下载:
https://drive.google.com/drive/folders/1cfA-SrPhDB9B8ZckPvnh8D5ysCjD-S_I?usp=share_link
w我自己写的预处理代码
import os
import argparse
import moviepy.editor as mp
def resize_and_convert_video(input_file, output_file, target_short_edge=320, output_format='mp4'):
# 读取输入视频
video = mp.VideoFileClip(input_file)
# 计算长边和短边的缩放比例
width, height = video.size
if width > height:
new_width = target_short_edge
new_height = int(height * (target_short_edge / width))
else:
new_height = target_short_edge
new_width = int(width * (target_short_edge / height))
# 调整视频尺寸
resized_video = video.resize(height=new_height, width=new_width)
# 转换视频格式并保存
output_file_path, _ = os.path.splitext(output_file)
output_file_path += f'.{output_format}'
resized_video.write_videofile(output_file_path, codec='libx264', audio_codec='aac')
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Resize and convert videos in a directory')
parser.add_argument('input_dir', type=str, help='input directory containing webm videos')
parser.add_argument('output_dir', type=str, help='output directory for resized and converted videos')
args = parser.parse_args()
# 创建输出目录
os.makedirs(args.output_dir, exist_ok=True)
# 遍历输入目录中的所有webm文件
for file_name in os.listdir(args.input_dir):
if file_name.lower().endswith('.webm'):
# 构建输入和输出文件路径
input_file = os.path.join(args.input_dir, file_name)
output_file = os.path.join(args.output_dir, os.path.splitext(file_name)[0] + '.mp4')
# 调整尺寸并转换格式
resize_and_convert_video(input_file, output_file, target_short_edge=320, output_format='mp4')
python resize_convert_videos.py input_dir output_dir
UCF101 (约6.5GB)
官网链接https://www.crcv.ucf.edu/data/UCF101.php
具体可以参考UCF101动作识别数据集简介绍及数据预处理
无需对UCF101进行其他预处理操作https://github.com/MCG-NJU/VideoMAE/issues/35 https://github.com/MCG-NJU/VideoMAE/issues/69
但是要划分数据集,作者没有给出数据集如何划分https://github.com/MCG-NJU/VideoMAE/issues/35
checkpoint下载https://drive.google.com/file/d/1MSyon6fPpKz7oqD6WDGPFK4k_Rbyb6fw/view