Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster AVX2 matrix multiplications for lgacy quants #405

Merged
merged 4 commits into from May 10, 2024

Conversation

ikawrakow
Copy link
Contributor

It seems some people still use the ggml legacy qunats Q4_0, Q4_1, Q5_0 and Q5_1, so here is a PR that improves matrix multiplication performance for these quants on AVX2. The gains for Q4_1, Q5_0 and Q5_1, which do not have tiniBLAS implementation are very significant, but even Q4_0 is faster than tinyBLAS (see table below).

I have gone for a templated implementation. This costs 2-3% in performance but reduces the code by at least a factor of 2.
The implementation requires at least a C++14 compiler because I have used auto for the return type of two functions. Is this a problem?

Prompt processing speed for a 512-token prompt (PP-512) for a 7B LLaMA model

CPU Quants PP-512 (Master) PP-512 (PR) Speedup
Ryzen-7950X Q4_0 114.5 130.6 1.141
Ryzen-7950X Q4_1 66.0 138.0 2.091
Ryzen-7950X Q5_0 55.8 126.4 2.265
Ryzen-7950X Q5_1 54.0 126.4 2.341
Ryzen-5975WX Q4_0 120.2 161.0 1.339
Ryzen-5975WX Q4_1 91.3 166.8 1.827
Ryzen-5975WX Q5_0 83.4 155.6 1.866
Ryzen-5975WX Q5_1 77.8 162.0 2.083

The PR can also help with token generation (TG) speed. On my system TG is fully memory bound for more than 4-8 threads (depending on quantization type). So, to have a better illustration of the performance differences, here are the TG-128 results with just 2 threads on a Ryzen-7950X for a 7B LLaMA model:

CPU Quants TG-128 (Master) TG-128 (PR) Speedup
Ryzen-7950X Q4_0 4.39 10.86 2.474
Ryzen-7950X Q4_1 5.69 11.49 2.019
Ryzen-7950X Q5_0 6.00 9.00 1.500
Ryzen-7950X Q5_1 4.67 8.79 1.882

Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Approved. Could you sync to head please? I needed to change the way your earlier contribution is compiled in an effort to make room in the binary size for flash attention. I basically just renamed a file and added an if statement which uses X86_HAVE(AVX2) to do dispatching at runtime. That helped me get your first iteration into a release and I can cut another once this is merged too. Thanks!

Somehow memcpy is kind of slow, so for
getting 4 bytes from 2-byte-aligned data
it is faster to just do or on two
consecutive 16-bit entries.
However, the way it is currently, we have lost the
zen4-tuned version.
@ikawrakow
Copy link
Contributor Author

@jart I adapted to head. But to get back the Ryzen-7950X performance I had to make two separate iqk_mul_map versions (one for generic AVX2 and one with AVX512F+AVX512VNNI+AVX512VL enabled).

Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have all the microarchitecture targets you need. LGTM. Thanks!

@jart
Copy link
Collaborator

jart commented May 10, 2024

Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.

I'd encourage you to work your magic on Cosmopolitan's memcpy() function. https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memmove.c You can run the tests by either running make -j32 or make -j32 o//test/libc/intrin.

@jart jart merged commit eaa756d into Mozilla-Ocho:main May 10, 2024
@jart
Copy link
Collaborator

jart commented May 10, 2024

Also, did you notice this? https://www.phoronix.com/news/Llamafile-0.8.2-More-AVX2 Congrats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants