Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why doesn't AutoGPTQ quantize lm_head layer? #647

Open
XeonKHJ opened this issue Apr 25, 2024 · 5 comments · May be fixed by #648
Open

Why doesn't AutoGPTQ quantize lm_head layer? #647

XeonKHJ opened this issue Apr 25, 2024 · 5 comments · May be fixed by #648

Comments

@XeonKHJ
Copy link

XeonKHJ commented Apr 25, 2024

Is there a paper/article/blog post explaining such decision? Or is it just simply a feature that not being supported at the moment?

@Qubitium
Copy link
Contributor

Qubitium commented Apr 25, 2024

@XeonKHJ Good question. I will test this tomorrow with intel/auto-round that does offer the ability to quantize lm-head. If there are no inference issues post quantize, I will make it as an option in new PR.

@Qubitium
Copy link
Contributor

Qubitium commented Apr 25, 2024

Found intel's results of quantization test of lm-head. There was minimal accuracy loss:

@wenhuach21 Do you know how much ram/vram the intel llama3-8B lm_head quantized test saved vs non-quantized? Here is the untested branch that allows loading of quanted lm-head that I plan to test: https://github.com/Qubitium/AutoGPTQ/tree/sym-false-lm-head combined with intel/auto-round#87

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric BF16 w4g128 w/o lm-head w4g128 with lm-head qdq
Avg. 0.6352 0.6312 0.6303
mmlu 0.6386 0.6306 0.6318
winogrande 0.7143 0.7238 0.7269
truthfulqa_mc1 0.3623 0.3537 0.3525
rte 0.6751 0.6859 0.6679
piqa 0.7867 0.7797 0.7802
openbookqa 0.3400 0.3300 0.3320
lambada_openai 0.7182 0.7200 0.7173
hellaswag 0.5769 0.5699 0.5701
boolq 0.8297 0.8309 0.8284
arc_easy 0.8152 0.8089 0.8106
arc_challenge 0.5299 0.5102 0.5154

@Qubitium Qubitium linked a pull request Apr 25, 2024 that will close this issue
1 task
@wenhuach21
Copy link

wenhuach21 commented Apr 26, 2024

Found intel's results of quantization test of lm-head. There was minimal accuracy loss:

@wenhuach21 Do you know how much ram/vram the intel llama3-8B lm_head quantized test saved vs non-quantized? Here is the untested branch that allows loading of quanted lm-head that I plan to test: https://github.com/Qubitium/AutoGPTQ/tree/sym-false-lm-head combined with intel/auto-round#87

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric BF16 w4g128 w/o lm-head w4g128 with lm-head qdq
Avg. 0.6352 0.6312 0.6303
mmlu 0.6386 0.6306 0.6318
winogrande 0.7143 0.7238 0.7269
truthfulqa_mc1 0.3623 0.3537 0.3525
rte 0.6751 0.6859 0.6679
piqa 0.7867 0.7797 0.7802
openbookqa 0.3400 0.3300 0.3320
lambada_openai 0.7182 0.7200 0.7173
hellaswag 0.5769 0.5699 0.5701
boolq 0.8297 0.8309 0.8284
arc_easy 0.8152 0.8089 0.8106
arc_challenge 0.5299 0.5102 0.5154

What I know is the model size at W4G128, W/O lm head 5.4G, with lm head 4.7G.

Additionally, if act-order is not enabled or static group is enabled, could Autogptq refrain from dumping the group index into the quantized model, thus conserving some resources

@Qubitium
Copy link
Contributor

Qubitium commented Apr 26, 2024

#648 can now load quantized lm_head from intel/auto-round but autogptq quantization of lm-head is still in progress.

@Qubitium
Copy link
Contributor

Additionally, if static grouping is not enabled, could Autogptq refrain from dumping the group index into the quantized model, thus conserving some resources.

This is beyond my abilities right now. @fxmarty @LaaZa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants