Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3* 4090GPU OOM #177

Open
orderer0001 opened this issue May 17, 2024 · 3 comments
Open

3* 4090GPU OOM #177

orderer0001 opened this issue May 17, 2024 · 3 comments

Comments

@orderer0001
Copy link

Why do 3* 4090GPUs still out of memory (24*3>52GB)

0 NVIDIA GeForce RTX 4090 Off | 00000000:31:00.0 Off | Off |
| 66% 24C P8 22W / 450W | 42MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:B1:00.0 Off | Off |
| 64% 23C P8 28W / 450W | 11MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:E3:00.0 Off | Off |
| 69% 25C P8 11W / 450W | 11MiB / 24564MiB | 0% Default |
| | | N/A

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 34.75 MiB is free. Including non-PyTorch memory, this process has 23.57 GiB memory in use. Of the allocated memory 22.89 GiB is allocated by PyTorch, and 307.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@czczup
Copy link
Member

czczup commented May 19, 2024

The automatic allocation scheme of device_map='auto' of transformers may not be reasonable, in which case you can try manually allocating GPU memory to achieve maximum utilization, for example:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,
    'language_model.model.layers.0': 0,
    'language_model.model.layers.1': 0,
    'language_model.model.layers.2': 0,
    'language_model.model.layers.3': 0,
    'language_model.model.layers.4': 0,
    'language_model.model.layers.5': 0,
    'language_model.model.layers.6': 0,
    'language_model.model.norm': 2,
    'language_model.output.weight': 2
}
for i in range(7, 28):
    device_map[f'language_model.model.layers.{i}'] = 1
for i in range(28, 48):
    device_map[f'language_model.model.layers.{i}'] = 2
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

@orderer0001
Copy link
Author

The automatic allocation scheme of device_map='auto' of transformers may not be reasonable, in which case you can try manually allocating GPU memory to achieve maximum utilization, for example:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,
    'language_model.model.layers.0': 0,
    'language_model.model.layers.1': 0,
    'language_model.model.layers.2': 0,
    'language_model.model.layers.3': 0,
    'language_model.model.layers.4': 0,
    'language_model.model.layers.5': 0,
    'language_model.model.layers.6': 0,
    'language_model.model.norm': 2,
    'language_model.output.weight': 2
}
for i in range(7, 28):
    device_map[f'language_model.model.layers.{i}'] = 1
for i in range(28, 48):
    device_map[f'language_model.model.layers.{i}'] = 2
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

thank you very much

@nofreewill42
Copy link

Would two 4090s be enough?
I have a 3090, thinking on getting another one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants