performance difference based on number of tokens to process #6777
-
good evening. I was trying to learn a little about llama.cpp library and stumbled upon this. This is a small code change based on examples/simple : okuvshynov@63cd5b5 adds N mock tokens on every llama_decode. All the tests were on m2 ultra and mistral-7b-v0.1.Q8_0 model. A question is - why is the difference between 0 and 1 so dramatic on GPU? Is there any optimization done specifically for this scenario, as I assume it's quite common? Or maybe i just misconfigured something? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The Metal backend uses 2 types of kernels to perform matrix multiplication:
The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS. There is a break-even point for certain BS where one kernel becomes more efficient than the other: Lines 1422 to 1447 in e8d35f4 I don't know how to determine that break-even point, so currently we always use mat-vec for BS=1 and mat-mat for all other BS which is certainly not optimal, but I don't know how to improve this. |
Beta Was this translation helpful? Give feedback.
The Metal backend uses 2 types of kernels to perform matrix multiplication:
The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.
There is a break-even point for certain BS where one kernel becomes more efficient than the other:
llama.cpp/ggml-metal.m
Lines 1422 to 1447 in e8d35f4