gemma-2b static quantized, generate text makes no sense #1853

CHNtentes · 2024-05-10T10:16:29Z

System Info

optimum 1.19.1
python 3.8.10
ubuntu 20.04

Who can help?

@michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Hi. I tried to convert gemma-2b to onnx format, then quantize it to 8 bit. However, quantized model doesn't generate any useful text, just random characters. I'm not sure what's causing this issue.

Procedure as below:

Use this command to convert gemma-2b to onnx, without kv cache:
optimum-cli export onnx -m ./gemma-2b --task text-generation --opset 14 --device cpu --trust-remote-code --legacy gemma-2b_onnx_without_past
Quantize onnx model:

from functools import partial
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTQuantizer, ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig

onnx_model = ORTModelForCausalLM.from_pretrained("gemma-2b_onnx_without_past", use_cache=False, use_io_binding=False)
tokenizer = AutoTokenizer.from_pretrained("gemma-2b_onnx_without_past")
decoder_quantizer = ORTQuantizer.from_pretrained(onnx_model)

qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)

def preprocess_fn(ex, tokenizer):
    encoded_inputs = tokenizer(ex["instruction"], return_tensors="pt", padding=True)
    return encoded_inputs

calibration_dataset = decoder_quantizer.get_calibration_dataset(
    "yahma/alpaca-cleaned",
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
)

calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)

ranges = decoder_quantizer.fit(
    dataset=calibration_dataset,
    calibration_config=calibration_config,
    operators_to_quantize=qconfig.operators_to_quantize,
    use_external_data_format=True,
    batch_size=1,
)

model_quantized_path = decoder_quantizer.quantize(
    save_dir="quantized_gemma",
    calibration_tensors_range=ranges,
    quantization_config=qconfig,
    use_external_data_format=True,
)

Run inference using ORTModelForCausalLM:

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("quantized_gemma")
model = ORTModelForCausalLM.from_pretrained("quantized_gemma", use_cache=False, use_io_binding=False)

query = 'Introduce yourself.'
encoded_inputs = tokenizer(query, return_tensors='pt')

outputs = model.generate(**encoded_inputs, max_new_tokens=64)

response = tokenizer.decode(outputs.tolist()[0])
print(response)

However, in the end the printed output is:

Introduce yourself.. Kids to to to to to loo7777zanie to certitudeBariumBariumToDecimal]]] import.

11ormick de de unintelligiblemiyormiyormiyor Islas of of of of of of of of Animal bourgorm ! XXIV metamor metamorToUpperToUpper CARRAYDOCX

Expected behavior

Here is what I got from gemma-2b onnx (not quantized):

Introduce yourself.
I’m a 20-year-old student from the Netherlands. I’m currently studying at the University of Amsterdam. I’m a student of the Faculty of Social Sciences, and I’m studying International Relations.

What is your current job?

I’m a student.

The text was updated successfully, but these errors were encountered:

CHNtentes added the bug Something isn't working label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemma-2b static quantized, generate text makes no sense #1853

gemma-2b static quantized, generate text makes no sense #1853

CHNtentes commented May 10, 2024 •

edited

gemma-2b static quantized, generate text makes no sense #1853

gemma-2b static quantized, generate text makes no sense #1853

Comments

CHNtentes commented May 10, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

CHNtentes commented May 10, 2024 •

edited