Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.cuda.OutOfMemoryError: CUDA out of memory.real_basicvsr/basicvsr++运行报错 #2117

Open
3 tasks done
txy00001 opened this issue Jan 25, 2024 · 2 comments
Open
3 tasks done
Assignees
Labels
kind/bug something isn't working

Comments

@txy00001
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmagic

Environment

image
cuda 11.7 cudnn8.9 gpu:4090

Reproduces the problem - code sample

import os
import time
from mmagic.apis import MMagicInferencer
from mmengine import mkdir_or_exist

# Create a MMagicInferencer instance and infer

video = '/home/txy/code/blur/video/6.mp4'
result_out_dir = '/home/txy/code/blur/output/6.mp4'
mkdir_or_exist(os.path.dirname(result_out_dir))
beg=time.time()
editor = MMagicInferencer('real_basicvsr', device='cuda:1')
results = editor.infer(video=video, result_out_dir=result_out_dir)
time.time-beg
print(time.time-beg)

Reproduces the problem - command or script

import os
import time
from mmagic.apis import MMagicInferencer
from mmengine import mkdir_or_exist

# Create a MMagicInferencer instance and infer

video = '/home/txy/code/blur/video/6.mp4'
result_out_dir = '/home/txy/code/blur/output/6.mp4'
mkdir_or_exist(os.path.dirname(result_out_dir))
beg=time.time()
editor = MMagicInferencer('real_basicvsr', device='cuda:1')
results = editor.infer(video=video, result_out_dir=result_out_dir)
time.time-beg
print(time.time-beg)

Reproduces the problem - error message

/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG19_Weights.IMAGENET1K_V1. You can also use weights=VGG19_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
Loads checkpoint by http backend from path: https://download.openmmlab.com/mmediting/restorers/real_basicvsr/realbasicvsr_c64b20_1x30x8_lr5e-5_150k_reds_20211104-52f77c2c.pth
The model and loaded state dict do not match exactly

unexpected key in source state_dict: step_counter

01/25 17:15:56 - mmengine - WARNING - Failed to search registry with scope "mmagic" in the "function" registry tree. As a workaround, the current "function" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmagic" is a correct scope, or whether the registry is initialized.
/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmengine/visualization/visualizer.py:196: UserWarning: Failed to add <class 'mmengine.visualization.vis_backend.LocalVisBackend'>, please provide the save_dir argument.
warnings.warn(f'Failed to add {vis_backend.class}, '
Traceback (most recent call last):
File "/home/txy/code/blur/demo/bas_real/real_infer_video.py", line 12, in
results = editor.infer(video=video, result_out_dir=result_out_dir)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/apis/mmagic_inferencer.py", line 231, in infer
return self.inferencer(
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/apis/inferencers/init.py", line 110, in call
return self.inferencer(**kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/apis/inferencers/base_mmagic_inferencer.py", line 139, in call
results = self.base_call(**kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/apis/inferencers/base_mmagic_inferencer.py", line 165, in base_call
preds = self.forward(data, **forward_kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/apis/inferencers/video_restoration_inferencer.py", line 127, in forward
result = self.model(
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/models/base_models/base_edit_model.py", line 109, in forward
return self.forward_tensor(inputs, data_samples, **kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/models/editors/real_esrgan/real_esrgan.py", line 112, in forward_tensor
feats = self.generator_ema(inputs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/models/editors/real_basicvsr/real_basicvsr_net.py", line 88, in forward
residues = self.image_cleaning(lqs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/mmagic/models/editors/basicvsr/basicvsr_net.py", line 214, in forward
return self.main(feat)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/txy/anaconda3/envs/mmpose/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 59.33 GiB (GPU 1; 23.65 GiB total capacity; 2.92 GiB already allocated; 19.91 GiB free; 2.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

我想对视频进行超分,视频是1s,和5s的,分辨率是1280×720,3160×2160,尝试real_basicvsr和basicvsr++都是
image
这个错误,换了服务器和视频依旧有问题,我该如何解决?

@txy00001 txy00001 added the kind/bug something isn't working label Jan 25, 2024
@Feynman1999
Copy link

Feynman1999 commented Jan 26, 2024

路过回复下(非官方),我猜测是传入模型的帧数太多了,比如你说的1s视频,1280x720,如果是30帧,需要的显存也不少了,应该在几十G的量级。
所以你需要确定:

  1. editor.infer传入video, 是解码所有帧然后将所有帧调用模型;还是使用max_seq_len参数来窗口调用;具体逻辑见代码:https://github.com/open-mmlab/mmagic/blob/main/mmagic/apis/inferencers/video_restoration_inferencer.py#L126
  2. 如果1.是将所有帧传入,那么就是因为帧数过多导致oom;如果1是已经只传部分帧,那么在你这个分辨率下,需要将部分帧减少(max_seq_len参数),来避免oom
    至于如何修改max_seq_len参数,你需要找下怎么对inferencer传参,整体的模型参数逻辑在
    https://github.com/open-mmlab/mmagic/blob/main/mmagic/apis/mmagic_inferencer.py#L150,看起来是会读取一个默认的配置文件

@txy00001
Copy link
Author

路过回复下(非官方),我猜测是传入模型的帧数太多了,比如你说的1s视频,1280x720,如果是30帧,需要的显存也不少了,应该在几十G的量级。 所以你需要确定:

  1. editor.infer传入video, 是解码所有帧然后将所有帧调用模型;还是使用max_seq_len参数来窗口调用;具体逻辑见代码:https://github.com/open-mmlab/mmagic/blob/main/mmagic/apis/inferencers/video_restoration_inferencer.py#L126
  2. 如果1.是将所有帧传入,那么就是因为帧数过多导致oom;如果1是已经只传部分帧,那么在你这个分辨率下,需要将部分帧减少(max_seq_len参数),来避免oom 至于如何修改max_seq_len参数,你需要找下怎么对inferencer传参,整体的模型参数逻辑在 https://github.com/open-mmlab/mmagic/blob/main/mmagic/apis/mmagic_inferencer.py#L150,看起来是会读取一个默认的配置文件

感谢,我试一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants