Why doesn't AutoGPTQ quantize lm_head layer? #647

XeonKHJ · 2024-04-25T10:56:58Z

Is there a paper/article/blog post explaining such decision? Or is it just simply a feature that not being supported at the moment?

Qubitium · 2024-04-25T17:05:59Z

@XeonKHJ Good question. I will test this tomorrow with intel/auto-round that does offer the ability to quantize lm-head. If there are no inference issues post quantize, I will make it as an option in new PR.

Qubitium · 2024-04-25T17:31:59Z

Found intel's results of quantization test of lm-head. There was minimal accuracy loss:

@wenhuach21 Do you know how much ram/vram the intel llama3-8B lm_head quantized test saved vs non-quantized? Here is the untested branch that allows loading of quanted lm-head that I plan to test: https://github.com/Qubitium/AutoGPTQ/tree/sym-false-lm-head combined with intel/auto-round#87

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric	BF16	w4g128 w/o lm-head	w4g128 with lm-head qdq
Avg.	0.6352	0.6312	0.6303
mmlu	0.6386	0.6306	0.6318
winogrande	0.7143	0.7238	0.7269
truthfulqa_mc1	0.3623	0.3537	0.3525
rte	0.6751	0.6859	0.6679
piqa	0.7867	0.7797	0.7802
openbookqa	0.3400	0.3300	0.3320
lambada_openai	0.7182	0.7200	0.7173
hellaswag	0.5769	0.5699	0.5701
boolq	0.8297	0.8309	0.8284
arc_easy	0.8152	0.8089	0.8106
arc_challenge	0.5299	0.5102	0.5154

wenhuach21 · 2024-04-26T02:43:24Z

Found intel's results of quantization test of lm-head. There was minimal accuracy loss:

@wenhuach21 Do you know how much ram/vram the intel llama3-8B lm_head quantized test saved vs non-quantized? Here is the untested branch that allows loading of quanted lm-head that I plan to test: https://github.com/Qubitium/AutoGPTQ/tree/sym-false-lm-head combined with intel/auto-round#87

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric BF16 w4g128 w/o lm-head w4g128 with lm-head qdq
Avg. 0.6352 0.6312 0.6303
mmlu 0.6386 0.6306 0.6318
winogrande 0.7143 0.7238 0.7269
truthfulqa_mc1 0.3623 0.3537 0.3525
rte 0.6751 0.6859 0.6679
piqa 0.7867 0.7797 0.7802
openbookqa 0.3400 0.3300 0.3320
lambada_openai 0.7182 0.7200 0.7173
hellaswag 0.5769 0.5699 0.5701
boolq 0.8297 0.8309 0.8284
arc_easy 0.8152 0.8089 0.8106
arc_challenge 0.5299 0.5102 0.5154

What I know is the model size at W4G128, W/O lm head 5.4G, with lm head 4.7G.

Additionally, if act-order is not enabled or static group is enabled, could Autogptq refrain from dumping the group index into the quantized model, thus conserving some resources

Qubitium · 2024-04-26T02:49:02Z

#648 can now load quantized lm_head from intel/auto-round but autogptq quantization of lm-head is still in progress.

Qubitium · 2024-04-26T03:07:55Z

Additionally, if static grouping is not enabled, could Autogptq refrain from dumping the group index into the quantized model, thus conserving some resources.

This is beyond my abilities right now. @fxmarty @LaaZa

Qubitium linked a pull request Apr 25, 2024 that will close this issue

[FEATURE] Allow loading of quantized lm_head #648

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why doesn't AutoGPTQ quantize lm_head layer? #647

Why doesn't AutoGPTQ quantize lm_head layer? #647

XeonKHJ commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

wenhuach21 commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024

Why doesn't AutoGPTQ quantize lm_head layer? #647

Why doesn't AutoGPTQ quantize lm_head layer? #647

Comments

XeonKHJ commented Apr 25, 2024 • edited

Qubitium commented Apr 25, 2024 • edited

Qubitium commented Apr 25, 2024 • edited

wenhuach21 commented Apr 26, 2024 • edited

Qubitium commented Apr 26, 2024 • edited

Qubitium commented Apr 26, 2024

XeonKHJ commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

wenhuach21 commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited