Skip to content

How to estimate VRAM requirement for GPTQ Quantization? #356

Answered by TheBloke
attkap asked this question in Q&A
Discussion options

You must be logged in to vote

At 4096 sequence length you will need less than 24GB VRAM to quantise 7B. You need more than 24GB VRAM to quantise 13B at 4096, unless cache_examples_on_gpu=False is used.

Try my quantising wrapper script: https://github.com/TheBlokeAI/AIScripts/blob/main/quant_autogptq.py

Run with params:

python3 ./quant_autogptq.py meta-llama/Llama-2-7b-hf llama-2-7b-hf-gptq c4 --bits 4 --group_size 128 --desc_act 1 --damp 0.1 --seqlen 4096

It should work fine. Adjust the GPTQ params if you want different parameters, eg group_size 32.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@attkap
Comment options

Answer selected by attkap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants