Qwen-7B模型的实现与原版本有差异 #1588

sleepwalker2017 · 2024-05-13T09:57:22Z

Hello，我在测试千问模型，这个模型使用了 logn_scaling。

我看到的 Python 层面的逻辑是，在 seq_len < 8k的时候不开启logn_scaling，即设置scaling = 1
在 seq_len > 8k的时候开启scaling，乘以一个不为 1 的系数。

我看到lmdeploy的实现，在 prefill 阶段，scaling 一直为 1。

如果用户输入的 input_ids一开始就超过了8k，看起来这两边的语义是对不齐的。

由于我的 GPU 显存有限，我没办法测试 8k 的prefill，于是我在千问的 config.json 文件里，把seq_len做了修改。

  "seq_length": 16,

这样修改之后，我发现结果并不是完全能对上了。

两边代码都使用一样的 token id

input_ids': tensor([[151644,   8948,    198,   2610,    525,    264,  10950,  17847,     13,
         151645,    198, 151644,    872,    198,   3838,   8573,    979,    498,
           2182,   5590,   1119,   3015,     30, 151645,    198, 151644,  77091,
            198]]

两边模型的输出不是完全一致的。

我的理解是否有误？这种测试方式是否有错误？
或者是千问的实现与原版本确实有差异？

这是我的复现代码，两边设置都是 greedy search。

pytorch

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("/data/Qwen-7B/", trust_remote_code=True)

inputs = tokenizer('What happens when you put oil into water?', return_tensors='pt')
input_ids = [[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 8573, 979, 498, 2182, 5590, 1119, 3015, 30, 151645, 198, 151644, 77091, 198]]
input_ids = torch.tensor(input_ids)
print(input_ids.shape)
token_type_ids = torch.zeros_like(input_ids)
attention_mask = torch.ones_like(input_ids)
inputs['input_ids']=input_ids
inputs['token_type_ids']=token_type_ids
inputs['attention_mask']=attention_mask

model = AutoModelForCausalLM.from_pretrained("/data/Qwen-7B/", device_map="cuda", trust_remote_code=True, fp16=True).eval()
inputs = inputs.to(model.device)
print('inputs is', inputs)

pred = model.generate(**inputs, max_new_tokens=64, num_beams=1, do_sample=False)
print(pred)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

lmdeploy的测试代码

import lmdeploy
from lmdeploy import GenerationConfig
pipe = lmdeploy.pipeline("/data/Qwen-7B")
gen_config = GenerationConfig(top_p=1.0,
                              top_k=1,
                              temperature=1.0,
                              max_new_tokens=64)
response = pipe(["What happens when you put oil into water?"], gen_config=gen_config)
print(response)

其中模型文件来自 https://huggingface.co/Qwen/Qwen-7B
修改如下

diff --git a/config.json b/config.json
index a7c2261..f1adbdb 100644
--- a/config.json
+++ b/config.json
@@ -25,7 +25,7 @@
   "rotary_emb_base": 10000,
   "rotary_pct": 1.0,
   "scale_attn_weights": true,
-  "seq_length": 8192,
+  "seq_length": 16,
   "tie_word_embeddings": false,
   "tokenizer_class": "QWenTokenizer",
   "transformers_version": "4.32.0",
@@ -34,4 +34,4 @@
   "use_flash_attn": "auto",
   "use_logn_attn": true,
   "vocab_size": 151936
-}
\ No newline at end of file
+}

The text was updated successfully, but these errors were encountered:

sleepwalker2017 closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-7B模型的实现与原版本有差异 #1588

Qwen-7B模型的实现与原版本有差异 #1588

sleepwalker2017 commented May 13, 2024 •

edited

Qwen-7B模型的实现与原版本有差异 #1588

Qwen-7B模型的实现与原版本有差异 #1588

Comments

sleepwalker2017 commented May 13, 2024 • edited

sleepwalker2017 commented May 13, 2024 •

edited