Deep Learning Debug记录

1 、numpy读.npy格式数据报错

ValueError: Object arrays cannot be loaded when allow_pickle=False

原因：
numpy版本的问题，在1.16.3版本后，allow_pickle的值默认设为False。
解决方案：
1、降低numpy的版本
2、在numpy.load()函数调用的地方将allow_pickle值设置为True np.load(src, allow_pickle=True)

2、unzip时出现问题

Archive:  GoPro_large9G.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.

原因：参考其它博客
在window主机上尝试解压，发现正常解压，说明源文件正常，百度说 “一般在linux下解压zip文件，直接用系统默认的extract here进行解压(默认使用的是 unzip)
如果压缩文件.zip是大于2G的，那unzip就无法使用了，这是由于C库中long类型数据所能表示的文件偏移在32位机子上只能有2G”
具体原因不明，也有可能是压缩包产生损坏

3、pytorch版本问题

anaconda3/envs/pytorchEnv/lib/python3.7/site-packages/torch/functional.py:478:：UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

解决方案：
根据报错找对对应的functional.py文件，根据报错的提示找到functional的504行加上代码indexing = ‘ij’

return _VF.meshgrid(tensors, **kwargs,indexing = 'ij')  # type: ignore[attr-defined]

4、模型加载权重报错

RuntimeError: Error(s) in loading state_dict for ResNet:
	Unexpected key(s) in state_dict: "module.conv1.weight", "module.bn1.weight", "module.bn1.bias", "module.bn1.running_mean", "module.bn1.running_var", "module.conv2.weight", "module.bn2.weight", "module.bn2.bias", "module.bn2.running_mean", "module.bn2.running_var", "module.conv3.weight", "module.bn3.weight", "module.bn3.bias", "module.bn3.running_mean", "module.bn3.running_var", "module.layer1.0.conv1.weight", "module.layer1.0.bn1.weight", "module.layer1.0.bn1.bias", "module.layer1.0.bn1.running_mean", "module.layer1.0.bn1.running_var", "module.layer1.0.conv2.weight", "module.layer1.0.bn2.weight", "module.layer1.0.bn2.bias",

解决方案：
模型权重问题

5、模型测试时需要扩充维度

训练时，数据维度一般都是 (batch_size, c, h, w)，而在测试时只输入一张图片(c,h,w)，所以需要扩充维度。
扩充维度

import cv2
import torch
 
image = cv2.imread(img_path)
#image = torch.tensor(image)
image = torch.from_numpy(image)
print(image.size())
 
img = image.unsqueeze(dim=0)  
print(img.size())
 
img = img.squeeze(dim=0)
print(img.size())
 
# output:
# torch.Size([(h, w, c)])
# torch.Size([1, h, w, c])
# torch.Size([h, w, c])

降低维度
维度压缩，这个函数会把张量中所有为1的维度全部删除，以此达到降维操作。如果输入的维度是 $\times 1 \times B \times C \times 1 \times D)$ 函数会输出维度为 $\times B \times C \times D)$ 。如果定义了维度dim的参数，那么函数只会处理对应的维度。

>>> x = torch.zeros(2, 1, 2, 1, 2)
>>> x.size()
torch.Size([2, 1, 2, 1, 2])

>>> y = torch.squeeze(x)
>>> y.size()
torch.Size([2, 2, 2])

>>> y = torch.squeeze(x, 0)
>>> y.size()
torch.Size([2, 1, 2, 1, 2])

>>> y = torch.squeeze(x, 1)
>>> y.size()
torch.Size([2, 2, 1, 2])

6、端口问题

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.

原因：
由于中途关闭DDP运行，从而没有释放DDP的相关端口号，显存占用信息，当下次再次运行DDP时，使用的端口号是使用的DDP默认的端口号，也即是29500，因此造成冲突。
解决方案：
1、手动释放显存，kill -9 pid 相关显存占用的进程，关闭所有这个服务器打开的终端，从而就能释放掉前一个DDP占用的显存和端口号
2、在命令行中在启动DDP命令中（在xx.py前）手动加上一句"_ _master_port=xxxxx"，如下图所示（注意需要释放前一个DDP占用的显存，可能会导致显存不足）：
3、直接在nvidia-smi命令中kill掉一个相关进程，就能强迫程序停止DDP，从而DDP就会自动释放掉相应的端口号和占用的显卡资源，或者直接在命令行Ctrl+C强制中断程序，也可以直接使用Ctrl+Z快捷键强制中断程序，只不过此时没有释放DDP的端口号，需要你手动改一下DDP需要占用的端口号。