-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Usage dropping before completion ends #669
Comments
How long is that prompt? Do you mind copying it here in text form so I can try it directly? |
I have tried it with long (many thousands of tokens) and short (~300 tokens) prompts. It produces the same issue. If you want to try my exact prompt here it is:
Or,
(sorry for all the edits) |
Can confirm.. it's really slow on the longer prompt. |
@jeanromainroy if it's possible, can you try rebooting your machine. That seems to resolve the speed issue on my end. I can generate quite quickly with the prompt you provided. |
Rebooting sometimes works, but not always. I tried rebooting approximately 10 times, and it worked about half of the time. I serve the |
Ok let me see if I can reproduce the bad state. Just starting and killing the flask server is enough to make it slow down? That's pretty wild. |
Here's my code if it can save time:
|
I ran the server / flask app you posted, then ctrl+c it. Then run the model regularly and it is the same speed (generating reasonably fast, e.g. about 7.5 tps). I'm not sure how to reproduce this yet. Maybe you could share the exact sequence of commands you use or some non-sensitive version? |
I'm running into a similar slow-down with an M2 Ultra Mac Studio (60 GPUs, 192 GB Mem) with Llama-2 70B Q8. This usually happens after loading and running different larger models. For this model, the TPS dropped from ~8 to under 1. Rebooting seems to resolve the problem in my case. |
Thanks for the data point. Still looking into a better solution for that. |
I have been using the new Command-R+ model in 4-bit mode and consistently observe a drop in GPU utilization immediately after prompt evaluation, as it begins generation/prediction. This leads to significantly reduced performance.
During evaluation:
During generation – drop occurs right before the first token is predicted (i.e. "<PAD>"):
Here's my setup:
Machine: Apple M2 Ultra (cores: 8E+16P+60GPU), 192GB Ram
ProductName: macOS
ProductVersion: 14.3
BuildVersion: 23D56
I have tried with and without setting my memory limit:
sudo sysctl iogpu.wired_lwm_mb=150000
I have tried with and without disabling the cache:
mx.metal.set_cache_limit(0)
Any help would be welcome, because at the moment I am only able to use the llama.cpp implementation of Command-R+, which works without any issues.
The text was updated successfully, but these errors were encountered: