cuda : ggml_mul_mat assert for padded src1 #673

ggerganov · 2023-12-29T08:43:29Z

Currently, the padded matrix multiplications in whisper.cpp are silently failing with CUDA:

Lines 224 to 230 in dbd0295

    
           // TODO: check if other platforms can benefit from this optimization 
        
           // TODO: CUDA is currently broken - seems ggml_mul_mat does not handle views correctly 
        
           #if defined(GGML_USE_METAL) 
        
           #define ggml_mul_mat ggml_mul_mat_pad 
        
           #endif

The reason is that the to_fp16_cuda and to_fp32_cuda calls assume no padding of the data. We can either assert that the data is not padded, or over-allocate a buffer accounting for the padding. The latter produces correct results, but is sub-optimal.

Drafting this PR to brainstorm some potential solutions

slaren · 2023-12-29T10:49:25Z

Is there any benefit to using ggml_mul_mat_pad with CUDA (and specially with cuBLAS)? This seems to me like it should be a implementation detail of the matrix multiplication algorithm.

ggerganov · 2023-12-29T11:36:52Z

Can't test the ggml_mul_mat_pad with CUDA because it would need some extra changes to to_fp16_cuda. Likely there will be no benefit in this case because cuBLAS will probably make the necessary optimizations internally. For Metal it helps because that mat-mat kernel requires dim 0 to be multiple of 32

But regardless, the issue remains if users try to perform matrix multiplications with padded views

slaren · 2023-12-29T11:45:35Z

I think that in general there is very poor support for any kind of non-contiguous tensors in ggml-cuda. We should definitely add more checks and improve support, but I am afraid that this is just the tip of the iceberg. We could extend test-backend-ops to automatically run tests with non-contiguous tensors for all the operations, so that we could have a better overall picture of what needs to be fixed.

slaren · 2023-12-29T11:55:01Z

At the moment I am working on moving llama.cpp fully to ggml-backend with ggml_backend_sched for partial offloading, and bringing everything up to feature parity with the old CUDA API. After that, we can remove a lot of code, globals, and generally do a lot of refactoring of the CUDA backend.

ggerganov · 2023-12-29T12:21:44Z

Sounds great - no rush on this padding problem

FSSRepo · 2023-12-29T16:22:04Z

I think the metal backend could benefit from expanding matrix multiplication support and significantly improving performance in some cases such as comment.

cuda : ggml_mul_mat assert for padded src1

c110412

ggerganov requested a review from slaren December 29, 2023 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : ggml_mul_mat assert for padded src1 #673

cuda : ggml_mul_mat assert for padded src1 #673

ggerganov commented Dec 29, 2023

slaren commented Dec 29, 2023

ggerganov commented Dec 29, 2023

slaren commented Dec 29, 2023

slaren commented Dec 29, 2023

ggerganov commented Dec 29, 2023

FSSRepo commented Dec 29, 2023


	// TODO: check if other platforms can benefit from this optimization
	// TODO: CUDA is currently broken - seems ggml_mul_mat does not handle views correctly
	#if defined(GGML_USE_METAL)
	#define ggml_mul_mat ggml_mul_mat_pad
	#endif

cuda : ggml_mul_mat assert for padded src1 #673

Are you sure you want to change the base?

cuda : ggml_mul_mat assert for padded src1 #673

Conversation

ggerganov commented Dec 29, 2023

slaren commented Dec 29, 2023

ggerganov commented Dec 29, 2023

slaren commented Dec 29, 2023

slaren commented Dec 29, 2023

ggerganov commented Dec 29, 2023

FSSRepo commented Dec 29, 2023