Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen-7B模型的实现与原版本有差异 #1588

Closed
sleepwalker2017 opened this issue May 13, 2024 · 0 comments
Closed

Qwen-7B模型的实现与原版本有差异 #1588

sleepwalker2017 opened this issue May 13, 2024 · 0 comments

Comments

@sleepwalker2017
Copy link
Contributor

sleepwalker2017 commented May 13, 2024

Hello, 我在测试千问模型,这个模型使用了 logn_scaling。

我看到的 Python 层面的逻辑是,在 seq_len < 8k的时候不开启logn_scaling,即设置scaling = 1
在 seq_len > 8k的时候开启scaling,乘以一个不为 1 的系数。

我看到lmdeploy的实现,在 prefill 阶段,scaling 一直为 1。

如果用户输入的 input_ids一开始就超过了8k,看起来这两边的语义是对不齐的。

由于我的 GPU 显存有限,我没办法测试 8k 的prefill,于是我在千问的 config.json 文件里,把seq_len做了修改。

  "seq_length": 16,

这样修改之后,我发现结果并不是完全能对上了。

两边代码都使用 一样的 token id

input_ids': tensor([[151644,   8948,    198,   2610,    525,    264,  10950,  17847,     13,
         151645,    198, 151644,    872,    198,   3838,   8573,    979,    498,
           2182,   5590,   1119,   3015,     30, 151645,    198, 151644,  77091,
            198]]

两边模型的输出不是完全一致的。

我的理解是否有误?这种测试方式是否有错误?
或者是千问的实现与原版本确实有差异?

这是我的复现代码,两边设置都是 greedy search。

pytorch

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("/data/Qwen-7B/", trust_remote_code=True)

inputs = tokenizer('What happens when you put oil into water?', return_tensors='pt')
input_ids = [[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 8573, 979, 498, 2182, 5590, 1119, 3015, 30, 151645, 198, 151644, 77091, 198]]
input_ids = torch.tensor(input_ids)
print(input_ids.shape)
token_type_ids = torch.zeros_like(input_ids)
attention_mask = torch.ones_like(input_ids)
inputs['input_ids']=input_ids
inputs['token_type_ids']=token_type_ids
inputs['attention_mask']=attention_mask

model = AutoModelForCausalLM.from_pretrained("/data/Qwen-7B/", device_map="cuda", trust_remote_code=True, fp16=True).eval()
inputs = inputs.to(model.device)
print('inputs is', inputs)

pred = model.generate(**inputs, max_new_tokens=64, num_beams=1, do_sample=False)
print(pred)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

lmdeploy的测试代码

import lmdeploy
from lmdeploy import GenerationConfig
pipe = lmdeploy.pipeline("/data/Qwen-7B")
gen_config = GenerationConfig(top_p=1.0,
                              top_k=1,
                              temperature=1.0,
                              max_new_tokens=64)
response = pipe(["What happens when you put oil into water?"], gen_config=gen_config)
print(response)

其中模型文件来自 https://huggingface.co/Qwen/Qwen-7B
修改如下

diff --git a/config.json b/config.json
index a7c2261..f1adbdb 100644
--- a/config.json
+++ b/config.json
@@ -25,7 +25,7 @@
   "rotary_emb_base": 10000,
   "rotary_pct": 1.0,
   "scale_attn_weights": true,
-  "seq_length": 8192,
+  "seq_length": 16,
   "tie_word_embeddings": false,
   "tokenizer_class": "QWenTokenizer",
   "transformers_version": "4.32.0",
@@ -34,4 +34,4 @@
   "use_flash_attn": "auto",
   "use_logn_attn": true,
   "vocab_size": 151936
-}
\ No newline at end of file
+}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant