Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #2807

Open
zkdfbb opened this issue Mar 5, 2024 · 3 comments
Open

CUDA out of memory #2807

zkdfbb opened this issue Mar 5, 2024 · 3 comments

Comments

@zkdfbb
Copy link

zkdfbb commented Mar 5, 2024

Hello, I want to evaluate my model after QuantizationSimModel, but encountered a CUDA out of memory error. Normally, evaluating the model requires 7GB of VRAM, but after quantization, even 80GB is not enough. How can I solve this problem?

My environment as follows:
python: 3.8.10
pytorch: 2.2.0
aimet: 1.30.0

Another problem:
where I export the quantized model to onnx, There is a warning message:
[W shape_type_inference.cpp:1973] Warning: The shape inference of aimet_torch::CustomMarker type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (function UpdateReliable)

It seems to have no impact on the exported ONNX model, but I still want to ask if this message can be eliminated

@quic-klhsieh
Copy link
Contributor

@zkdfbb , can you share some details about your workflow/pipeline? Are you simply instantiating your model, passing it into QuantizationSimModel, computing encodings, and then using qsim.model(...) to run evaluation?

If you could also provide some other metrics:

  • size of the original model
  • size of the quantized model
  • are you using any range learning quantization scheme in quantsim init?
  • how much memory the original model and quantized model take for forward pass if evaluation is done in a with torch.no_grad() scope

Regarding the onnx warning message, currently we have not really looked into how to silence these warnings. We can mark it as a to do item for better user experience.

@zkdfbb
Copy link
Author

zkdfbb commented Mar 8, 2024

@quic-klhsieh , You can try to reproduce with the code below.

I have solved the problem, I use @torch.no_grad to decorate a function which include model forward and post processing.
Normally it works fine but failed with aimet. After I use with torch.no_grad() only with the model forward, problem solved.

But I think there is still a problem with quantization aware training, the GPU memory is growing so fast that it becomes unusable quickly.

import torch
from tqdm import tqdm
from torchvision.models.resnet import resnet152
from aimet_common.defs import QuantScheme
from aimet_torch.quantsim import QuantizationSimModel


class Dataset(torch.utils.data.Dataset):
    def __len__(self):
        return 160

    def __getitem__(self, idx):
        return torch.randn([3, 224, 224])


def compute_forward(model, dataloader):
    model.eval()
    with torch.no_grad():
        for inputs in tqdm(dataloader):
            model(inputs.cuda())
    model.train()

dataset = Dataset()
dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)
model = resnet152().cuda()

dummy_input = torch.randn([1, 3, 224, 224]).cuda()
sim = QuantizationSimModel(model=model,
                           quant_scheme=QuantScheme.training_range_learning_with_tf_init,
                           dummy_input=dummy_input,
                           rounding_mode='nearest',
                           default_output_bw=8,
                           default_param_bw=8,
                           in_place=False)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
sim.compute_encodings(forward_pass_callback=compute_forward, forward_pass_callback_args=dataloader)
# GPU: 1587 MB

model = sim.model
model.eval()
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
data_iter = iter(dataloader)
inputs = next(data_iter)
model(inputs.cuda())
# GPU: 13337 MB

inputs = next(data_iter)
model(inputs.cuda())
# GPU: 25811 MB

inputs = next(data_iter)
model(inputs.cuda())
# GPU: 38289 MB

@quic-klhsieh
Copy link
Contributor

@zkdfbb Glad to hear torch.no_grad() worked for you. Thank you for the code snippet, we can take a look at why the memory continues to increase with each iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants