CUDA out of memory

I am running the code on a single machine with an A100 80GB of GPU memory, and I encountered the following error:
Traceback (most recent call last):
  File "main_fft_pretrain.py", line 302, in <module>
    main(args)
  File "main_fft_pretrain.py", line 270, in main
    train_stats = train_one_epoch(
  File "/data0/zhiyong/code/github/mae/engine_pretrain.py", line 48, in train_one_epoch
    loss, _, _ = model(samples, mask_ratio=args.mask_ratio)
  File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 641, in forward
    latent, mask, ids_restore = self.forward_encoder(imgs, mask_ratio)
  File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 545, in forward_encoder
    x_combined = blk(x_combined)
  File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 165, in forward
    x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
  File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 99, in forward
    attn = attn.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 8.90 GiB (GPU 0; 79.21 GiB total capacity; 60.10 GiB already allocated; 7.09 GiB free; 60.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My agrs:
python main_fft_pretrain.py      --world_size 2      --batch_size 4      --model mae_vit_fft_base_patch16      --norm_
pix_loss      --mask_ratio 0.75      --epochs 800      --warmup_epochs 40      --blr 1.5e-4 --weight_decay 0.05      --data_path /data0/zhiyong/data/imagenetResize
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA out of memory #205

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

CUDA out of memory #205

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions