Faster AVX2 matrix multiplications for lgacy quants #405

ikawrakow · 2024-05-07T16:12:19Z

It seems some people still use the ggml legacy qunats Q4_0, Q4_1, Q5_0 and Q5_1, so here is a PR that improves matrix multiplication performance for these quants on AVX2. The gains for Q4_1, Q5_0 and Q5_1, which do not have tiniBLAS implementation are very significant, but even Q4_0 is faster than tinyBLAS (see table below).

I have gone for a templated implementation. This costs 2-3% in performance but reduces the code by at least a factor of 2.
The implementation requires at least a C++14 compiler because I have used auto for the return type of two functions. Is this a problem?

Prompt processing speed for a 512-token prompt (PP-512) for a 7B LLaMA model

CPU	Quants	PP-512 (Master)	PP-512 (PR)	Speedup
Ryzen-7950X	Q4_0	114.5	130.6	1.141
Ryzen-7950X	Q4_1	66.0	138.0	2.091
Ryzen-7950X	Q5_0	55.8	126.4	2.265
Ryzen-7950X	Q5_1	54.0	126.4	2.341
Ryzen-5975WX	Q4_0	120.2	161.0	1.339
Ryzen-5975WX	Q4_1	91.3	166.8	1.827
Ryzen-5975WX	Q5_0	83.4	155.6	1.866
Ryzen-5975WX	Q5_1	77.8	162.0	2.083

The PR can also help with token generation (TG) speed. On my system TG is fully memory bound for more than 4-8 threads (depending on quantization type). So, to have a better illustration of the performance differences, here are the TG-128 results with just 2 threads on a Ryzen-7950X for a 7B LLaMA model:

CPU	Quants	TG-128 (Master)	TG-128 (PR)	Speedup
Ryzen-7950X	Q4_0	4.39	10.86	2.474
Ryzen-7950X	Q4_1	5.69	11.49	2.019
Ryzen-7950X	Q5_0	6.00	9.00	1.500
Ryzen-7950X	Q5_1	4.67	8.79	1.882

jart

Looks good to me. Approved. Could you sync to head please? I needed to change the way your earlier contribution is compiled in an effort to make room in the binary size for flash attention. I basically just renamed a file and added an if statement which uses X86_HAVE(AVX2) to do dispatching at runtime. That helped me get your first iteration into a release and I can cut another once this is merged too. Thanks!

Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.

However, the way it is currently, we have lost the zen4-tuned version.

ikawrakow · 2024-05-10T14:53:40Z

@jart I adapted to head. But to get back the Ryzen-7950X performance I had to make two separate iqk_mul_map versions (one for generic AVX2 and one with AVX512F+AVX512VNNI+AVX512VL enabled).

jart

You can have all the microarchitecture targets you need. LGTM. Thanks!

jart · 2024-05-10T19:13:39Z

Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.

I'd encourage you to work your magic on Cosmopolitan's memcpy() function. https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memmove.c You can run the tests by either running make -j32 or make -j32 o//test/libc/intrin.

jart · 2024-05-10T19:16:24Z

Also, did you notice this? https://www.phoronix.com/news/Llamafile-0.8.2-More-AVX2 Congrats!

jart approved these changes May 10, 2024

View reviewed changes

Kawrakow added 3 commits May 10, 2024 16:35

Matrix multiplications for legacy qunats

8f7394f

Very slightly faster Q5 dequantization

897be80

Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.

Make it work after rebase

a3bca82

However, the way it is currently, we have lost the zen4-tuned version.

ikawrakow force-pushed the ik/new_legacy_mul_mat branch from 1baed32 to a3bca82 Compare May 10, 2024 14:44

Restore faster AVX512VNNI+AVX512VL performance

610a3e9

jart approved these changes May 10, 2024

View reviewed changes

jart merged commit eaa756d into Mozilla-Ocho:main May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster AVX2 matrix multiplications for lgacy quants #405

Faster AVX2 matrix multiplications for lgacy quants #405

ikawrakow commented May 7, 2024

jart left a comment

ikawrakow commented May 10, 2024

jart left a comment

jart commented May 10, 2024

jart commented May 10, 2024

Faster AVX2 matrix multiplications for lgacy quants #405

Faster AVX2 matrix multiplications for lgacy quants #405

Conversation

ikawrakow commented May 7, 2024

jart left a comment

Choose a reason for hiding this comment

ikawrakow commented May 10, 2024

jart left a comment

Choose a reason for hiding this comment

jart commented May 10, 2024

jart commented May 10, 2024