Skip to content

performance difference based on number of tokens to process #6777

Answered by ggerganov
okuvshynov asked this question in Q&A
Discussion options

You must be logged in to vote

The Metal backend uses 2 types of kernels to perform matrix multiplication:

  • mat-vec
  • mat-mat

The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.

There is a break-even point for certain BS where one kernel becomes more efficient than the other:

llama.cpp/ggml-metal.m

Lines 1422 to 1447 in e8d35f4

// find the break-even point where the matrix-matrix kernel becomes more efficient compared
// to the matrix-vector kernel
int ne11_mm_min = 1;
#if 0
// the numbers below are measured on M2 Ultra for 7B and 13B models
// these numbers do…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@okuvshynov
Comment options

Answer selected by okuvshynov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants