-
I am trying to quantize Llama-2-7b-hf using GPTQ with a bitsize of 4 on the c4 dataset, but I keep getting an OOM error. I am using a Nvidia 4090 with 24GB VRAM. Inference of the base model only takes up about 12GB of VRAM, so I would have expected for it to be no problem... Can anyone explain to me how much VRAM I need to quantize successfully? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
At 4096 sequence length you will need less than 24GB VRAM to quantise 7B. You need more than 24GB VRAM to quantise 13B at 4096, unless Try my quantising wrapper script: https://github.com/TheBlokeAI/AIScripts/blob/main/quant_autogptq.py Run with params:
It should work fine. Adjust the GPTQ params if you want different parameters, eg group_size 32. |
Beta Was this translation helpful? Give feedback.
At 4096 sequence length you will need less than 24GB VRAM to quantise 7B. You need more than 24GB VRAM to quantise 13B at 4096, unless
cache_examples_on_gpu=False
is used.Try my quantising wrapper script: https://github.com/TheBlokeAI/AIScripts/blob/main/quant_autogptq.py
Run with params:
It should work fine. Adjust the GPTQ params if you want different parameters, eg group_size 32.