Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash Attention #844

Closed
ss4elby opened this issue May 12, 2024 · 6 comments
Closed

Flash Attention #844

ss4elby opened this issue May 12, 2024 · 6 comments

Comments

@ss4elby
Copy link

ss4elby commented May 12, 2024

So I noticed it runs WAY slow, then realized my card was not set up for that, I am running ye oldie p40. So no tensor cores. But this fellow over at flash attention apparently made it possible to work without them? ggerganov#7188 I assume this in not implemented for this yet, any chance?

@LostRuins
Copy link
Owner

No, it's not implemented yet. I will merge it for the next version

@ss4elby
Copy link
Author

ss4elby commented May 13, 2024

Appreciated, your work is something amazing!

@Spacellary
Copy link

Spacellary commented May 14, 2024

Truly a joyous occasion! This looks very promising!

@LostRuins
Copy link
Owner

Hi, can see if this works fine for you on the latest version?

@gustrd
Copy link

gustrd commented May 24, 2024

I checked with my old MX150 and now it works.

The llama.cpp upgrade to CUDA without tensor cores must have solved it. The prompt processing speed is higher now (around 2x faster), but the generation a bit slower (around 20%). But this is a good tradeoff in the end.

@ss4elby
Copy link
Author

ss4elby commented May 24, 2024

It seems to work fine, holy hell its quick too. Thank you!

@ss4elby ss4elby closed this as completed May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants