New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Allow loading of quantized lm_head #648
base: main
Are you sure you want to change the base?
Conversation
[PASSED] Tested inference of tinyllama 1.1b quanted with Manually add/set |
When evaluating my local llama3 quantized checkpoint, I found several issues. 2 With quantized head and adopt AutoGPTQForCausalLM, issue 1 still occurs and log info is not correct "INFO - The layer lm_head is not quantized." sample code tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
model = AutoModelForCausalLM.from_pretrained(quantized_model_dir,
device_map="auto"
)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_safetensors=True, use_triton=True, trust_remote_code=True)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=True)[0])) |
@wenhuach21 main and this PR is using custom code to load weights due to bug in old accelerate and PR #640 is using accelerate (fixed). For my testing, I used intel/auto-round#87 to quant a tinyllama 1.1b with lm-head enabled (but you need to manually inject |
@wenhuach21 please test using this lm-head quant paired with sym-false-lm-head branch in prev msg. https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy
import random
import os
RAND_SEED = 898
torch.manual_seed(RAND_SEED)
numpy.random.seed(RAND_SEED)
random.seed(RAND_SEED)
torch.cuda.manual_seed_all(RAND_SEED)
quantized_model = "LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse"
device = torch.device("cuda:0")
prompt = "My name is Lewis and I like to"
tokenizer = AutoTokenizer.from_pretrained(quantized_model)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# quantized model inference
model = AutoGPTQForCausalLM.from_quantized(quantized_model, use_safetensors=True, device=device)
res = model.model.generate(**inputs, num_beams=1, min_new_tokens=1, max_new_tokens=128, repetition_penalty=1.25)
print("------quantized model inference------")
print(tokenizer.decode(res[0]))
print("--------------------------------------")
|
|
|
@fxmarty PR ready and unit-tested. I chose to add Actual quantization support of |
@wenhuach21 Did I resolve both of your issues? |
Hello, may I know when this PR will be merged? We are preparing to release the new version of AutoRound, and we would like autogptq to support lm-head quantization. |
@wenhuach21 I don't have the authority to merge this. Waiting for @fxmarty or @PanQiWei @fxmarty This is a 3 line PR (exlcuding comments/tests) which adds There is a chicken-and-egg problem here. Without an agreed-upon property to store
I think the only blocking issue here is if @fxmarty @PanQiWei can come to the consensus to the naming of |
Thank you for the information. As a workaround, we are modifying transformers in our repo to support this. We choose to add an extra config to support lm head quantization and mixed bits quantization in the future |
Intel/auto-round benchmarked llama3-8B result of lm_head quantization and the eval/accuracy loss is minimal. If vram can be saved with small loss, this can be great for larger models that are trying to fit into consumer class 24GB gpus.
Partially resolves: #647
https://github.com/intel/auto-round/blob/8a3da144423322dfedb0b3fa702ae35d242496d8/docs/Meta-Llama-3-8B-Instruct-acc.md?plain=1#L3
Resolve #647 and intel/auto-round#87 (comment)