Skip to content

Is there a document for Accuracy/Perplexity Scores for Llama2 with WOQ? #1628

Answered by yiliu30
VishalX asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @VishalX, it is a very interesting question.

Let's take a close to the generate pipeline:

input_word (str) -> tokenizer.encode(input_word) -> input_ids (token) -> model or q_model -> logits (tensor) -> predicted_ids (token) -> tokenizer.decode -> output_word (str)

The quantization process unavoidably introduces errors, causing the distribution of the quantized model's output (logits) to shift slightly from the output of the float model. These shifted logits are then converted to predicted_ids (token), and the resulting token might construct words in another language.

Let's encode the output back to tokens:

output: ONNX Runtime is prisoner categorieпута Clientública одногоúblicaúblic

t…

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@VishalX
Comment options

@yiliu30
Comment options

Answer selected by VishalX
@VishalX
Comment options

@yiliu30
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants