performance difference based on number of tokens to process #6777

okuvshynov · 2024-04-20T02:23:50Z

okuvshynov
Apr 20, 2024

good evening. I was trying to learn a little about llama.cpp library and stumbled upon this.

This is a small code change based on examples/simple :

okuvshynov@63cd5b5 adds N mock tokens on every llama_decode.
We ignore the output for those fake tokens, clean the cache for them and just keep going as before.

All the tests were on m2 ultra and mistral-7b-v0.1.Q8_0 model.

A question is - why is the difference between 0 and 1 so dramatic on GPU? Is there any optimization done specifically for this scenario, as I assume it's quite common? Or maybe i just misconfigured something?

Answered by ggerganov

Apr 21, 2024

The Metal backend uses 2 types of kernels to perform matrix multiplication:

mat-vec
mat-mat

The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.

There is a break-even point for certain BS where one kernel becomes more efficient than the other:

llama.cpp/ggml-metal.m

Lines 1422 to 1447 in e8d35f4

     // find the break-even point where the matrix-matrix kernel becomes more efficient compared  
   // to the matrix-vector kernel  
   int ne11_mm_min = 1;  
    
   #if 0  
   // the numbers below are measured on M2 Ultra for 7B and 13B models  
   // these numbers do…

View full answer

ggerganov · 2024-04-21T12:42:54Z

ggerganov
Apr 21, 2024
Maintainer

The Metal backend uses 2 types of kernels to perform matrix multiplication:

mat-vec
mat-mat

The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.

There is a break-even point for certain BS where one kernel becomes more efficient than the other:

llama.cpp/ggml-metal.m

Lines 1422 to 1447 in e8d35f4

    
                                   // find the break-even point where the matrix-matrix kernel becomes more efficient compared 
        
                                   // to the matrix-vector kernel 
        
                                   int ne11_mm_min = 1; 
        
           #if 0 
        
                                   // the numbers below are measured on M2 Ultra for 7B and 13B models 
        
                                   // these numbers do not translate to other devices or model sizes 
        
                                   // TODO: need to find a better approach 
        
                                   if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) { 
        
                                       switch (src0t) { 
        
                                           case GGML_TYPE_F16:  ne11_mm_min = 2;  break; 
        
                                           case GGML_TYPE_Q8_0: ne11_mm_min = 7;  break; 
        
                                           case GGML_TYPE_Q2_K: ne11_mm_min = 15; break; 
        
                                           case GGML_TYPE_Q3_K: ne11_mm_min = 7;  break; 
        
                                           case GGML_TYPE_Q4_0: 
        
                                           case GGML_TYPE_Q4_1: ne11_mm_min = 15; break; 
        
                                           case GGML_TYPE_Q4_K: ne11_mm_min = 11; break; 
        
                                           case GGML_TYPE_Q5_0:                          // not tested yet 
        
                                           case GGML_TYPE_Q5_1: ne11_mm_min = 13; break; // not tested yet 
        
                                           case GGML_TYPE_Q5_K: ne11_mm_min = 7;  break; 
        
                                           case GGML_TYPE_Q6_K: ne11_mm_min = 7;  break; 
        
                                           default:             ne11_mm_min = 1;  break; 
        
                                       } 
        
                                   } 
        
           #endif

I don't know how to determine that break-even point, so currently we always use mat-vec for BS=1 and mat-mat for all other BS which is certainly not optimal, but I don't know how to improve this.

1 reply

okuvshynov Apr 21, 2024
Author

Thank you, that makes a lot of sense!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance difference based on number of tokens to process #6777

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

	// find the break-even point where the matrix-matrix kernel becomes more efficient compared
	// to the matrix-vector kernel
	int ne11_mm_min = 1;

	#if 0
	// the numbers below are measured on M2 Ultra for 7B and 13B models
	// these numbers do…

performance difference based on number of tokens to process #6777

okuvshynov Apr 20, 2024

Replies: 1 comment · 1 reply

ggerganov Apr 21, 2024 Maintainer

okuvshynov Apr 21, 2024 Author

okuvshynov
Apr 20, 2024

Replies: 1 comment 1 reply

ggerganov
Apr 21, 2024
Maintainer

okuvshynov Apr 21, 2024
Author