Is there a document for Accuracy/Perplexity Scores for Llama2 with WOQ? #1628

VishalX · 2024-02-23T12:51:40Z

VishalX
Feb 23, 2024

I was trying out some WOQ algorithms and was curious to know the perplexity scores for different algorithms (RTN/GPTQ/AWQ) and with different configs (Sym/Asym, Group-size).

If something like this already exists, can someone pls point me to it?

Answered by yiliu30

Feb 26, 2024

Hi @VishalX, it is a very interesting question.

Let's take a close to the generate pipeline:

input_word (str) -> tokenizer.encode(input_word) -> input_ids (token) -> model or q_model -> logits (tensor) -> predicted_ids (token) -> tokenizer.decode -> output_word (str)

The quantization process unavoidably introduces errors, causing the distribution of the quantized model's output (logits) to shift slightly from the output of the float model. These shifted logits are then converted to predicted_ids (token), and the resulting token might construct words in another language.

Let's encode the output back to tokens:

output: ONNX Runtime is prisoner categorieпута Clientública одногоúblicaúblic

t…

View full answer

yiliu30 · 2024-02-23T15:33:17Z

yiliu30
Feb 23, 2024
Collaborator

Hi @VishalX. Thanks for your interest in our project!
You can find the documentation and script for quantizing the model and evaluating the perplexity here.

https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm#llama2-7b13b30b

4 replies

VishalX Feb 23, 2024
Author

Thanks @yiliu30.

I have already tried quantizing Llama-2-7b with few algorithms.

I have exported the model to ONNX using ONNX Runtime utilities and quantized it using GPTQ with symmetric and asymmetric schemes.

The responses I get for prompts look good with the Symmetric scheme. However, with the Asymmetric scheme, the responses are a mix of English and German words/sentences.

Is this because of quantization errors Or is this expected?
If not, any idea what else can cause this?

I saw this behaviour when I encountered an issue which I found on ORT related to Asymm MatMulNBits.
microsoft/onnxruntime#19450

I saw a similar issue which was posted on openvinotoolkit GitHub. openvinotoolkit/openvino_notebooks#1683

yiliu30 Feb 26, 2024
Collaborator

Hi @VishalX, it is a very interesting question.

Let's take a close to the generate pipeline:

input_word (str) -> tokenizer.encode(input_word) -> input_ids (token) -> model or q_model -> logits (tensor) -> predicted_ids (token) -> tokenizer.decode -> output_word (str)

The quantization process unavoidably introduces errors, causing the distribution of the quantized model's output (logits) to shift slightly from the output of the float model. These shifted logits are then converted to predicted_ids (token), and the resulting token might construct words in another language.

Let's encode the output back to tokens:

output: ONNX Runtime is prisoner categorieпута Clientública одногоúblicaúblic

tokens: [1, 6732, 29940, 29990, 24875, 338, 29871, 27314, 25727, 21961, 12477, 11632, 24356, 11632, 7627]

For instance, token 21961 corresponds to пута, which is in another language. (For details, refer to tokenizer.json)

The extent of the shifted error depends on the quantization algorithms (asym/sym, GPTQ/RTN).
I think the asym quantization scheme may introduce more errors than the sym scheme.

Answer selected by VishalX

VishalX Feb 26, 2024
Author

Thanks @yiliu30. I agree, it is the asymmetric quant which is introducing some more errors. However, when I do a Perplexity calculation, the same asymmetric quantized model beats symmetric by a small margin. This is a bit confusing for me.

I see that the WOQ examples for Llama2 use Pile-10k dataset for calibration. On the OpenVino issue (I added a link in earlier reply), the suggestion seems to use wikitext for calibration. Do you think if I change the dataset, it would make a difference in neural-compressor too?
Or is openvinotoolkit also using neural-compressor?

yiliu30 Feb 26, 2024
Collaborator

The openvino toolkit uses nncf to quantize the model.

For GPTQ, the quantized model is impacted by the calibration dataset.
For RTN, using a different dataset does not yield a different quantized model, as RTN is a data-free quantization algorithm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a document for Accuracy/Perplexity Scores for Llama2 with WOQ? #1628

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is there a document for Accuracy/Perplexity Scores for Llama2 with WOQ? #1628

VishalX Feb 23, 2024

Replies: 1 comment · 4 replies

yiliu30 Feb 23, 2024 Collaborator

VishalX Feb 23, 2024 Author

yiliu30 Feb 26, 2024 Collaborator

VishalX Feb 26, 2024 Author

yiliu30 Feb 26, 2024 Collaborator

VishalX
Feb 23, 2024

Replies: 1 comment 4 replies

yiliu30
Feb 23, 2024
Collaborator

VishalX Feb 23, 2024
Author

yiliu30 Feb 26, 2024
Collaborator

VishalX Feb 26, 2024
Author

yiliu30 Feb 26, 2024
Collaborator