New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BENCHmarking new flash attention! #1676
Comments
Completed graphs below. A few highlights:
Enjoy! |
Hello, What is the average seq_length in your benchmark? The flash attention have a better performance for the long prompt only. |
Great question! I don't know if by First, I need to clarify my testing methodology. My specific use-case is RAG-related. For my test, I'm feeding the model the following question: QUESTION HERE
The perfect answer to this question must mention two things (and two things only): I consider all responses that address these (and only these) two points as perfect in quality even if different models answer slightly differently. I.e., I'm not judging on commas or syntax, just substance. However, if a model's response includes extra information that does not directly answer my question, even if the extra information is accurate based on the contexts, there's a slight deduction for "quality." These kinds of responses are not "incorrect" in that they give false information nor are it "incomplete;" for example, by mentioning 72 hours but not what happens if it falls on a weekend or legal holiday. Obviously, there's a serious reduction if factually false info is included. If any models produce factually inaccurate information I remove them from my testing altogether. For example, if a GGUF Q_2 produces false information I'll remove it but still keep the Q8_0 quant as long as it doesn't... With that being said, and to ACTUALLY answer your question, "it depends. I'll give you some sample responses and you tell me if they're long form or short form (and hence not fair to test FA with). NOTE: All of these were produced by GGUF models since I haven't updated my ctranslate2 script to also save all to a single .txt file, but ctranslate2's responses are basically the same. It's just easier to paste some GGUF responses for you right now... EXAMPLE RESPONSES FROM QWEN_1_8B
COMPARE TO QWEN_.5B
COMPARE TO THE BADASS NEURAL-CHAT 7B
Again, you tell me if these are long form or short form...or if that actually refers to my question to the LLM... Apart from RAG applications, it's virtually impossible to test for "quality" because it's so diverse and subjective. One short story written by neural-chat might be great but the next one crap because the sampling took a wrong turn somewhere...or you just like the writing style of one or the other...With RAG and definitive facts, you can assess whether the answer is complete or not... I'm not interested in "long-form" testing if you're referring to "write me a short story" and then calculating the tokens per second. I apologize if I wasn't as clear about this up front, but my testing is specifically geared towards RAG APPLICATIONS. In conclusion, I will say that there is always runtime variations and it's impossible to control all variables. Even if I shutdown and restart my computer from scratch, one reboot will show VRAM usage as being .1 GB higher/lower...stuff like that. Please anyone seeing my testing, take them with a grain of salt or as a rule of thumb regarding general trends, and MAKE SURE TO TEST YOURSELF! :-) Sincerely, P.S. I'd love to hear your expert opinion regarding how - longform or short form - I'm still seeing a huge benefit for using a higher beam size. That, to me, was the most interesting finding! |
I tested all of the models yet again this morning one right after the other, without opening closing other programs, etc. Here's the results...Seems to confirm yet again the unique behavior of llama2-based models and the new flash attention. EVEN LLAMA3 did not exhibit this behavior...
|
To further illustrate...here is a chart for |
I means number of token of input. I would be great to compare with and without FA2 with the prompt's size from 1000 to 3000 tokens. I think the prompt that you used is quite good but not sure it is long enough. |
Good to know. If one is efficient at RAG then you shouldn't need more than 3 contexts in MOST use cases. However, in a substantial minority of cases it's sometimes beneficial to have much larger inputs to a model...e.g. 1-3k, and leave 1k for the model to respond. I just haven't had a chance to test that yet but it's on my "to do" list. ;-) |
Running llama2-13b with flash attention on and off lead the the same result as with llama2-7b...Only a miniscule advantage of using flash attention, nowhere near the advantages with mistral-based models and/or SOLAR (and Llama3 for that matter): Flash Attention True:
Flash Attention False:
|
I benched https://huggingface.co/cognitivecomputations/dolphin-llama2-7b: With Flash Attention:
Without Flash Attention:
Once again, a llama-2 based model does not reap the benefits of flash attention like a Mistral/SOLAR/llama3 model does...To be clear, there's some benefit (more than with vanilla llama2), but not exponentially like with those other model architectures... |
More benchmarks...wanted to see if flash attention was better utilized when running in bfloat16, the model's original format, which it still doesn't benefit as much as mistral/solar/llama3... With Flash Attention:
No Flash Attention:
|
Congrats on Flash Attention in the latest version, or to be precise, in having your storage limit increased on Pypi.org so you could upload the release that was weeks ago. Here are some benchmarks for your edification!
Testing Setup
generate_batch
method with various beam sizes. The model was unloaded and reloaded each time - i.e. no caching of static prompt. Tokens per second, however, is only calculated on compute time not model loading time.Observations:
Implications
For example:
In my RAG application I love the Zephyr 3b model. If I put beams at 5, it religiously produces the same quality of results as my other favorite for RAG: neural-chat 7B (using a single beam). However, using 5 beams with the Zephyr 3b model puts its VRAM usage to MORE than the neural-chat model (again, which uses 1 beam). Therefore, there is no point in using Zephyr 3b since I can simply use neural-chat 7b.
Now, however, I can can use Zephyr 3B with 5 beams to achieve the same quality of neural-chat 7b (with 1 beam), but with less VRAM.
In conclusion, GGUF is "sort of" faster because it's only faster if you disregard Ctranslate2's ability to boost the quality of smaller models to one tier up by using beam size + flash attention, which, in the process, dramatically reduces VRAM usage and will inevitably increase tokens per second as well.
Thus, it's not appropriate to do a direct comparison of
int8
andGGUF Q8_0
as of late. Here's a graph for you visual people. Feel free to contact me with any questions! Also, this is only a hobby so I'm always open to critiques.The text was updated successfully, but these errors were encountered: