[FEATURE] Allow loading of quantized lm_head #648

Qubitium · 2024-04-25T17:44:06Z

Intel/auto-round benchmarked llama3-8B result of lm_head quantization and the eval/accuracy loss is minimal. If vram can be saved with small loss, this can be great for larger models that are trying to fit into consumer class 24GB gpus.

Partially resolves: #647

PASSED: Test loading/inference of quantized lm_head from intel/autoround

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric	BF16	w4g128 w/o lm-head	w4g128 with lm-head qdq
Avg.	0.6352	0.6312	0.6303
mmlu	0.6386	0.6306	0.6318
winogrande	0.7143	0.7238	0.7269
truthfulqa_mc1	0.3623	0.3537	0.3525
rte	0.6751	0.6859	0.6679
piqa	0.7867	0.7797	0.7802
openbookqa	0.3400	0.3300	0.3320
lambada_openai	0.7182	0.7200	0.7173
hellaswag	0.5769	0.5699	0.5701
boolq	0.8297	0.8309	0.8284
arc_easy	0.8152	0.8089	0.8106
arc_challenge	0.5299	0.5102	0.5154

Resolve #647 and intel/auto-round#87 (comment)

Qubitium · 2024-04-26T02:47:32Z

[PASSED] Tested inference of tinyllama 1.1b quanted with intel/auto-round + --quant_lm_head. lm-head loading is normal and inference is coherent. @wenhuach21

Manually add/set lm_head: True in autoround's quantize_config.json

wenhuach21 · 2024-04-26T05:28:40Z

When evaluating my local llama3 quantized checkpoint, I found several issues.
1 When using AutoGPTQForCausalLM to load the model without quantized LM-head, some weights remain unloaded, unlike when using AutoModelForCausalLM from transformers, which doesn't encounter this issue.
[auto_gptq.utils.accelerate_utils] Some weights of the model checkpoint at were not used when initializing LlamaForCausalLM: {'model.layers.16.self_attn.k_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.26.mlp.up_proj.bias', 'model.layers.28.self_attn.k_proj.bias', 'model.layers.23.self_attn.q_proj.bias',

2 With quantized head and adopt AutoGPTQForCausalLM, issue 1 still occurs and log info is not correct "INFO - The layer lm_head is not quantized."

sample code

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
model = AutoModelForCausalLM.from_pretrained(quantized_model_dir,
                                             device_map="auto"
                                             )
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_safetensors=True, use_triton=True, trust_remote_code=True)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=True)[0]))

Qubitium · 2024-04-26T06:04:39Z

@wenhuach21 main and this PR is using custom code to load weights due to bug in old accelerate and PR #640 is using accelerate (fixed).

For my testing, I used intel/auto-round#87 to quant a tinyllama 1.1b with lm-head enabled (but you need to manually inject lm_head: True in quantize_config.json and then inference with https://github.com/Qubitium/AutoGPTQ/commits/sym-false-lm-head/ (combines PR #640 and this PR)

Qubitium · 2024-04-26T07:25:04Z

@wenhuach21 please test using this lm-head quant paired with sym-false-lm-head branch in prev msg.

https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy
import random
import os


RAND_SEED = 898

torch.manual_seed(RAND_SEED)
numpy.random.seed(RAND_SEED)
random.seed(RAND_SEED)
torch.cuda.manual_seed_all(RAND_SEED)

quantized_model = "LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse"

device = torch.device("cuda:0")
prompt = "My name is Lewis and I like to"

tokenizer = AutoTokenizer.from_pretrained(quantized_model)

inputs = tokenizer(prompt, return_tensors="pt").to(device)

# quantized model inference
model = AutoGPTQForCausalLM.from_quantized(quantized_model, use_safetensors=True, device=device)
res = model.model.generate(**inputs, num_beams=1, min_new_tokens=1, max_new_tokens=128, repetition_penalty=1.25)
print("------quantized model inference------")
print(tokenizer.decode(res[0]))
print("--------------------------------------")

# inference result:
<s> My name is Lewis and I like to play football.
I'm a 16 year old boy who loves playing football, but also likes to do other things such as go out with friends or watch movies.</s>

Qubitium · 2024-04-26T08:16:51Z

# prompt
There is a girl who likes adventure,

# result
<s> There is a girl who likes adventure, and she's always looking for new things to do. She loves the outdoors, but also enjoys going on shopping trips with her friends.
Casey: I am 16 years old. My favorite thing about Casey is that he has an amazing sense of humor! He can make me laugh every day.
Mia: Mia is my best friend. We have been friends since we were little kids. Her mom was in our class at school so we grew up together. Now we are both in highschool and it feels like we haven't changed much from when we were

Qubitium · 2024-04-27T11:26:21Z

PASS: Unit-test added for loading quantized llama model with quantized model.lm_head 7b1d115

Qubitium · 2024-04-27T11:30:17Z

@fxmarty PR ready and unit-tested. I chose to add lm_head to quantize_config.json instead of an override parameter to from_quantized since this is a new feature and any override from param doesn't make much sense for old quants. The model should explicitly describe this new feature/toggle in config.

Actual quantization support of lm_head by autogptq itself can be worked on in a separate PR. For now, this feature is pairable with 3rd party quantizers such as intel/auto-round.

Qubitium · 2024-04-28T14:08:38Z

When evaluating my local llama3 quantized checkpoint, I found several issues. 1 When using AutoGPTQForCausalLM to load the model without quantized LM-head, some weights remain unloaded, unlike when using AutoModelForCausalLM from transformers, which doesn't encounter this issue. [auto_gptq.utils.accelerate_utils] Some weights of the model checkpoint at were not used when initializing LlamaForCausalLM: {'model.layers.16.self_attn.k_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.26.mlp.up_proj.bias', 'model.layers.28.self_attn.k_proj.bias', 'model.layers.23.self_attn.q_proj.bias',

2 With quantized head and adopt AutoGPTQForCausalLM, issue 1 still occurs and log info is not correct "INFO - The layer lm_head is not quantized."

@wenhuach21 Did I resolve both of your issues?

wenhuach21 · 2024-05-14T07:49:28Z

Hello, may I know when this PR will be merged? We are preparing to release the new version of AutoRound, and we would like autogptq to support lm-head quantization.

Qubitium · 2024-05-22T08:07:49Z

@wenhuach21 I don't have the authority to merge this. Waiting for @fxmarty or @PanQiWei

@fxmarty This is a 3 line PR (exlcuding comments/tests) which adds lm_head config property to config so that intel/auto-round can use in their quantization, and we can on-load allow the loading of quantized lm_head layers.

There is a chicken-and-egg problem here. Without an agreed-upon property to store lm_head true/false, auto-round can't move forward with a stable release that is runnable with autogptq inference. autoround can produce the quants but no stable framework can run it if no one can agree to the lm_head property name. There is also a vllm PR I am working on that also enables gptq lm_head quantization that requires checking for lm_head or similar property name in quant config vllm-project/vllm#4442:

# vllm code
if quantization_config.get("lm_head"):
            lm_head_quantized = True

I think the only blocking issue here is if @fxmarty @PanQiWei can come to the consensus to the naming of lm_head in quantization_config as the stable property name of quantizatble lm-head.

wenhuach21 · 2024-05-23T01:51:09Z

@wenhuach21 I don't have the authority to merge this. Waiting for @fxmarty or @PanQiWei

@fxmarty This is a 3 line PR (exlcuding comments/tests) which adds lm_head config property to config so that intel/auto-round can use in their quantization, and we can on-load allow the loading of quantized lm_head layers.

There is a chicken-and-egg problem here. Without an agreed-upon property to store lm_head true/false, auto-round can't move forward with a stable release that is runnable with autogptq inference. autoround can produce the quants but no stable framework can run it if no one can agree to the lm_head property name. There is also a vllm PR I am working on that also enables gptq lm_head quantization that requires checking for lm_head or similar property name in quant config vllm-project/vllm#4442:
# vllm code
if quantization_config.get("lm_head"):
            lm_head_quantized = True
I think the only blocking issue here is if @fxmarty @PanQiWei can come to the consensus to the naming of lm_head in quantization_config as the stable property name of quantizatble lm-head.

Thank you for the information. As a workaround, we are modifying transformers in our repo to support this. We choose to add an extra config to support lm head quantization and mixed bits quantization in the future
"quantization_config": {
"bits": 4,
"group_size": 128,
"data_type": "int",
"extra_config": {
"lm_head": {
"bits": 4,
"data_type": "int",
"group_size": 128,
"sym": false
}
},
},

test loading of quantized lm_head

79a2f76

Qubitium changed the title ~~[WIP] Allow loading/quantization of lm_head~~ [WIP] Allow loading of quantized lm_head Apr 25, 2024

fix quantized lm_head loading

64c604c

Qubitium mentioned this pull request Apr 26, 2024

Why doesn't AutoGPTQ quantize lm_head layer? #647

Open

Qubitium changed the title ~~[WIP] Allow loading of quantized lm_head~~ Allow loading of quantized lm_head Apr 26, 2024

update

842fd3c

Qubitium marked this pull request as ready for review April 26, 2024 02:53

add unittest test_lm_head

7b1d115

Qubitium changed the title ~~Allow loading of quantized lm_head~~ [FEATURE] Allow loading of quantized lm_head Apr 28, 2024

Qubitium mentioned this pull request Apr 29, 2024

[CORE] Allow loading of quantized lm_head (ParallelLMHead) vllm-project/vllm#4442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Allow loading of quantized lm_head #648

[FEATURE] Allow loading of quantized lm_head #648

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited

wenhuach21 commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024

Qubitium commented Apr 27, 2024

Qubitium commented Apr 27, 2024 •

edited

Qubitium commented Apr 28, 2024

wenhuach21 commented May 14, 2024

Qubitium commented May 22, 2024

wenhuach21 commented May 23, 2024

[FEATURE] Allow loading of quantized lm_head #648

Are you sure you want to change the base?

[FEATURE] Allow loading of quantized lm_head #648

Conversation

Qubitium commented Apr 25, 2024 • edited

Qubitium commented Apr 26, 2024 • edited

wenhuach21 commented Apr 26, 2024 • edited

Qubitium commented Apr 26, 2024 • edited

Qubitium commented Apr 26, 2024 • edited

Qubitium commented Apr 26, 2024

Qubitium commented Apr 27, 2024

Qubitium commented Apr 27, 2024 • edited

Qubitium commented Apr 28, 2024

wenhuach21 commented May 14, 2024

Qubitium commented May 22, 2024

wenhuach21 commented May 23, 2024

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited

wenhuach21 commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited

Qubitium commented Apr 26, 2024 •

edited

Qubitium commented Apr 27, 2024 •

edited