I am running the code on a single machine with an A100 80GB of GPU memory, and I encountered the following error:
Traceback (most recent call last):
File "main_fft_pretrain.py", line 302, in
main(args)
File "main_fft_pretrain.py", line 270, in main
train_stats = train_one_epoch(
File "/data0/zhiyong/code/github/mae/engine_pretrain.py", line 48, in train_one_epoch
loss, _, _ = model(samples, mask_ratio=args.mask_ratio)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 641, in forward
latent, mask, ids_restore = self.forward_encoder(imgs, mask_ratio)
File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 545, in forward_encoder
x_combined = blk(x_combined)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 165, in forward
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in call_impl
return forward_call(*input, **kwargs)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 99, in forward
attn = attn.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 8.90 GiB (GPU 0; 79.21 GiB total capacity; 60.10 GiB already allocated; 7.09 GiB free; 60.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My agrs:
python main_fft_pretrain.py --world_size 2 --batch_size 4 --model mae_vit_fft_base_patch16 --norm
pix_loss --mask_ratio 0.75 --epochs 800 --warmup_epochs 40 --blr 1.5e-4 --weight_decay 0.05 --data_path /data0/zhiyong/data/imagenetResize
I am running the code on a single machine with an A100 80GB of GPU memory, and I encountered the following error:
Traceback (most recent call last):
File "main_fft_pretrain.py", line 302, in
main(args)
File "main_fft_pretrain.py", line 270, in main
train_stats = train_one_epoch(
File "/data0/zhiyong/code/github/mae/engine_pretrain.py", line 48, in train_one_epoch
loss, _, _ = model(samples, mask_ratio=args.mask_ratio)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 641, in forward
latent, mask, ids_restore = self.forward_encoder(imgs, mask_ratio)
File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 545, in forward_encoder
x_combined = blk(x_combined)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 165, in forward
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in call_impl
return forward_call(*input, **kwargs)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 99, in forward
attn = attn.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 8.90 GiB (GPU 0; 79.21 GiB total capacity; 60.10 GiB already allocated; 7.09 GiB free; 60.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My agrs:
python main_fft_pretrain.py --world_size 2 --batch_size 4 --model mae_vit_fft_base_patch16 --norm
pix_loss --mask_ratio 0.75 --epochs 800 --warmup_epochs 40 --blr 1.5e-4 --weight_decay 0.05 --data_path /data0/zhiyong/data/imagenetResize