Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Allow loading of quantized lm_head #648

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Apr 25, 2024

Intel/auto-round benchmarked llama3-8B result of lm_head quantization and the eval/accuracy loss is minimal. If vram can be saved with small loss, this can be great for larger models that are trying to fit into consumer class 24GB gpus.

Partially resolves: #647

  • PASSED: Test loading/inference of quantized lm_head from intel/autoround

https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3

Metric BF16 w4g128 w/o lm-head w4g128 with lm-head qdq
Avg. 0.6352 0.6312 0.6303
mmlu 0.6386 0.6306 0.6318
winogrande 0.7143 0.7238 0.7269
truthfulqa_mc1 0.3623 0.3537 0.3525
rte 0.6751 0.6859 0.6679
piqa 0.7867 0.7797 0.7802
openbookqa 0.3400 0.3300 0.3320
lambada_openai 0.7182 0.7200 0.7173
hellaswag 0.5769 0.5699 0.5701
boolq 0.8297 0.8309 0.8284
arc_easy 0.8152 0.8089 0.8106
arc_challenge 0.5299 0.5102 0.5154

Resolve #647 and intel/auto-round#87 (comment)

@Qubitium Qubitium changed the title [WIP] Allow loading/quantization of lm_head [WIP] Allow loading of quantized lm_head Apr 25, 2024
@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 26, 2024

[PASSED] Tested inference of tinyllama 1.1b quanted with intel/auto-round + --quant_lm_head. lm-head loading is normal and inference is coherent. @wenhuach21

Manually add/set lm_head: True in autoround's quantize_config.json

@Qubitium Qubitium changed the title [WIP] Allow loading of quantized lm_head Allow loading of quantized lm_head Apr 26, 2024
@Qubitium Qubitium marked this pull request as ready for review April 26, 2024 02:53
@wenhuach21
Copy link

wenhuach21 commented Apr 26, 2024

When evaluating my local llama3 quantized checkpoint, I found several issues.
1 When using AutoGPTQForCausalLM to load the model without quantized LM-head, some weights remain unloaded, unlike when using AutoModelForCausalLM from transformers, which doesn't encounter this issue.
[auto_gptq.utils.accelerate_utils] Some weights of the model checkpoint at were not used when initializing LlamaForCausalLM: {'model.layers.16.self_attn.k_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.26.mlp.up_proj.bias', 'model.layers.28.self_attn.k_proj.bias', 'model.layers.23.self_attn.q_proj.bias',

2 With quantized head and adopt AutoGPTQForCausalLM, issue 1 still occurs and log info is not correct "INFO - The layer lm_head is not quantized."

sample code

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
model = AutoModelForCausalLM.from_pretrained(quantized_model_dir,
                                             device_map="auto"
                                             )
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_safetensors=True, use_triton=True, trust_remote_code=True)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=True)[0]))

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 26, 2024

@wenhuach21 main and this PR is using custom code to load weights due to bug in old accelerate and PR #640 is using accelerate (fixed).

For my testing, I used intel/auto-round#87 to quant a tinyllama 1.1b with lm-head enabled (but you need to manually inject lm_head: True in quantize_config.json and then inference with https://github.com/Qubitium/AutoGPTQ/commits/sym-false-lm-head/ (combines PR #640 and this PR)

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 26, 2024

@wenhuach21 please test using this lm-head quant paired with sym-false-lm-head branch in prev msg.

https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy
import random
import os


RAND_SEED = 898

torch.manual_seed(RAND_SEED)
numpy.random.seed(RAND_SEED)
random.seed(RAND_SEED)
torch.cuda.manual_seed_all(RAND_SEED)

quantized_model = "LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse"

device = torch.device("cuda:0")
prompt = "My name is Lewis and I like to"

tokenizer = AutoTokenizer.from_pretrained(quantized_model)

inputs = tokenizer(prompt, return_tensors="pt").to(device)

# quantized model inference
model = AutoGPTQForCausalLM.from_quantized(quantized_model, use_safetensors=True, device=device)
res = model.model.generate(**inputs, num_beams=1, min_new_tokens=1, max_new_tokens=128, repetition_penalty=1.25)
print("------quantized model inference------")
print(tokenizer.decode(res[0]))
print("--------------------------------------")
# inference result:
<s> My name is Lewis and I like to play football.
I'm a 16 year old boy who loves playing football, but also likes to do other things such as go out with friends or watch movies.</s>

@Qubitium
Copy link
Contributor Author

# prompt
There is a girl who likes adventure,

# result
<s> There is a girl who likes adventure, and she's always looking for new things to do. She loves the outdoors, but also enjoys going on shopping trips with her friends.
Casey: I am 16 years old. My favorite thing about Casey is that he has an amazing sense of humor! He can make me laugh every day.
Mia: Mia is my best friend. We have been friends since we were little kids. Her mom was in our class at school so we grew up together. Now we are both in highschool and it feels like we haven't changed much from when we were

@Qubitium
Copy link
Contributor Author

  • PASS: Unit-test added for loading quantized llama model with quantized model.lm_head 7b1d115

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 27, 2024

@fxmarty PR ready and unit-tested. I chose to add lm_head to quantize_config.json instead of an override parameter to from_quantized since this is a new feature and any override from param doesn't make much sense for old quants. The model should explicitly describe this new feature/toggle in config.

Actual quantization support of lm_head by autogptq itself can be worked on in a separate PR. For now, this feature is pairable with 3rd party quantizers such as intel/auto-round.

@Qubitium
Copy link
Contributor Author

When evaluating my local llama3 quantized checkpoint, I found several issues. 1 When using AutoGPTQForCausalLM to load the model without quantized LM-head, some weights remain unloaded, unlike when using AutoModelForCausalLM from transformers, which doesn't encounter this issue. [auto_gptq.utils.accelerate_utils] Some weights of the model checkpoint at were not used when initializing LlamaForCausalLM: {'model.layers.16.self_attn.k_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.26.mlp.up_proj.bias', 'model.layers.28.self_attn.k_proj.bias', 'model.layers.23.self_attn.q_proj.bias',

2 With quantized head and adopt AutoGPTQForCausalLM, issue 1 still occurs and log info is not correct "INFO - The layer lm_head is not quantized."

@wenhuach21 Did I resolve both of your issues?

@Qubitium Qubitium changed the title Allow loading of quantized lm_head [FEATURE] Allow loading of quantized lm_head Apr 28, 2024
@wenhuach21
Copy link

Hello, may I know when this PR will be merged? We are preparing to release the new version of AutoRound, and we would like autogptq to support lm-head quantization.

@Qubitium
Copy link
Contributor Author

@wenhuach21 I don't have the authority to merge this. Waiting for @fxmarty or @PanQiWei

@fxmarty This is a 3 line PR (exlcuding comments/tests) which adds lm_head config property to config so that intel/auto-round can use in their quantization, and we can on-load allow the loading of quantized lm_head layers.

There is a chicken-and-egg problem here. Without an agreed-upon property to store lm_head true/false, auto-round can't move forward with a stable release that is runnable with autogptq inference. autoround can produce the quants but no stable framework can run it if no one can agree to the lm_head property name. There is also a vllm PR I am working on that also enables gptq lm_head quantization that requires checking for lm_head or similar property name in quant config vllm-project/vllm#4442:

# vllm code
if quantization_config.get("lm_head"):
            lm_head_quantized = True

I think the only blocking issue here is if @fxmarty @PanQiWei can come to the consensus to the naming of lm_head in quantization_config as the stable property name of quantizatble lm-head.

@wenhuach21
Copy link

@wenhuach21 I don't have the authority to merge this. Waiting for @fxmarty or @PanQiWei

@fxmarty This is a 3 line PR (exlcuding comments/tests) which adds lm_head config property to config so that intel/auto-round can use in their quantization, and we can on-load allow the loading of quantized lm_head layers.

There is a chicken-and-egg problem here. Without an agreed-upon property to store lm_head true/false, auto-round can't move forward with a stable release that is runnable with autogptq inference. autoround can produce the quants but no stable framework can run it if no one can agree to the lm_head property name. There is also a vllm PR I am working on that also enables gptq lm_head quantization that requires checking for lm_head or similar property name in quant config vllm-project/vllm#4442:

# vllm code
if quantization_config.get("lm_head"):
            lm_head_quantized = True

I think the only blocking issue here is if @fxmarty @PanQiWei can come to the consensus to the naming of lm_head in quantization_config as the stable property name of quantizatble lm-head.

Thank you for the information. As a workaround, we are modifying transformers in our repo to support this. We choose to add an extra config to support lm head quantization and mixed bits quantization in the future
"quantization_config": {
"bits": 4,
"group_size": 128,
"data_type": "int",
"extra_config": {
"lm_head": {
"bits": 4,
"data_type": "int",
"group_size": 128,
"sym": false
}
},
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why doesn't AutoGPTQ quantize lm_head layer?
2 participants