You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(py_llama) st@server03:~/mnn-llm$ ./build/cli_demo ./models/llama3/
model path is ./models/llama3/
### model name : Llama3_8b
The device support i8sdot:0, support fp16:0, support i8mm: 0
load tokenizer
load tokenizer Done
### disk embedding is 1
[ 10% ] load ./models/llama3//lm.mnn model ... Done!
[ 15% ] load ./models/llama3//block_0.mnn model ... Done!
[ 18% ] load ./models/llama3//block_1.mnn model ... Done!
[ 21% ] load ./models/llama3//block_2.mnn model ... Done!
[ 23% ] load ./models/llama3//block_3.mnn model ... Done!
[ 26% ] load ./models/llama3//block_4.mnn model ... Done!
[ 29% ] load ./models/llama3//block_5.mnn model ... Done!
[ 31% ] load ./models/llama3//block_6.mnn model ... Done!
[ 34% ] load ./models/llama3//block_7.mnn model ... Done!
[ 36% ] load ./models/llama3//block_8.mnn model ... Done!
[ 39% ] load ./models/llama3//block_9.mnn model ... Done!
[ 42% ] load ./models/llama3//block_10.mnn model ... Done!
[ 44% ] load ./models/llama3//block_11.mnn model ... Done!
[ 47% ] load ./models/llama3//block_12.mnn model ... Done!
[ 50% ] load ./models/llama3//block_13.mnn model ... Done!
[ 52% ] load ./models/llama3//block_14.mnn model ... Done!
[ 55% ] load ./models/llama3//block_15.mnn model ... Done!
[ 58% ] load ./models/llama3//block_16.mnn model ... Done!
[ 60% ] load ./models/llama3//block_17.mnn model ... Done!
[ 63% ] load ./models/llama3//block_18.mnn model ... Done!
[ 66% ] load ./models/llama3//block_19.mnn model ... Done!
[ 68% ] load ./models/llama3//block_20.mnn model ... Done!
[ 71% ] load ./models/llama3//block_21.mnn model ... Done!
[ 74% ] load ./models/llama3//block_22.mnn model ... Done!
[ 76% ] load ./models/llama3//block_23.mnn model ... Done!
[ 79% ] load ./models/llama3//block_24.mnn model ... Done!
[ 81% ] load ./models/llama3//block_25.mnn model ... Done!
[ 84% ] load ./models/llama3//block_26.mnn model ... Done!
[ 87% ] load ./models/llama3//block_27.mnn model ... Done!
[ 89% ] load ./models/llama3//block_28.mnn model ... Done!
[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!
then to ask it returns
Q: who are you
A: You're asking "who"?
#################################
total tokens num = 20
prompt tokens num = 13
output tokens num = 7
total time = 2.59 s
prefill time = 1.31 s
decode time = 1.28 s
total speed = 7.73 tok/s
prefill speed = 9.92 tok/s
decode speed = 5.48 tok/s
chat speed = 2.71 tok/s
##################################
Q:
A: You're asking "are"?
#################################
total tokens num = 39
prompt tokens num = 32
output tokens num = 7
total time = 4.21 s
prefill time = 2.81 s
decode time = 1.41 s
total speed = 9.26 tok/s
prefill speed = 11.40 tok/s
decode speed = 4.98 tok/s
chat speed = 1.66 tok/s
##################################
Q:
A: You're asking "you"?
#################################
total tokens num = 58
prompt tokens num = 51
output tokens num = 7
total time = 4.82 s
prefill time = 3.48 s
decode time = 1.34 s
total speed = 12.04 tok/s
prefill speed = 14.64 tok/s
decode speed = 5.24 tok/s
chat speed = 1.45 tok/s
##################################
Q: introduce Beijing
A: You're asking "introduce"?
#################################
total tokens num = 84
prompt tokens num = 76
output tokens num = 8
total time = 6.32 s
prefill time = 5.19 s
decode time = 1.14 s
total speed = 13.29 tok/s
prefill speed = 14.66 tok/s
decode speed = 7.04 tok/s
chat speed = 1.27 tok/s
##################################
Q:
A: You're asking "Beijing"?
#################################
total tokens num = 108
prompt tokens num = 100
output tokens num = 8
total time = 7.68 s
prefill time = 6.51 s
decode time = 1.17 s
total speed = 14.06 tok/s
prefill speed = 15.37 tok/s
decode speed = 6.81 tok/s
chat speed = 1.04 tok/s
##################################
Any solution? Thanks !!
The text was updated successfully, but these errors were encountered:
[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!
prompt file is ./resource/prompt.txt
### warmup ... Done
It's great to chat with you! How are you doing today?
哈哈!我是 ChatGPT,一个人工智能语言模型!
I'm just an AI, I don't have access to real-time weather information. However, you can check the weather forecast online or on your local weather app to get an idea of the current weather conditions.
#################################
prompt tokens num = 54
decode tokens num = 77
prefill time = 3.85 s
decode time = 12.91 s
prefill speed = 14.02 tok/s
decode speed = 5.96 tok/s
##################################
It looks like llama3 only can response with llm->response(prompts[i]), not chat with llm->chat() ? @wangzhaode Do you have any suggestions, please!
when I run the llama3 mnn model
then to ask it returns
Any solution? Thanks !!
The text was updated successfully, but these errors were encountered: