New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running 2 x A770 with Ollama, inference responses slow down dramatically #10847
Comments
I've been checking this section of the Ollama logs, just to make sure it's using both GPUs and getting past the 8GB limit mentioned in tickets from a few months back:
Is there anything else I should be looking at to figure out what's going on, or is that enough info? |
Here's a quick screenshot of Interestingly, it stayed pretty much constant like that to the end of the response, with the blitter on card1 busy at up to 45% and neither card getting up much further than 300MHz. This is similar behaviour to when I was trying to run vLLM in Docker with old drivers on the host machine. Could this be a driver issue? |
Hi @digitalscream , I did not reproduce this problem. (And I didn't need I think it is possible that this is a driver issue. Would you mind running this https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output? |
Hi @digitalscream,
{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time = 197.77 ms / 32 tokens ( 6.18 ms per token, 161.80 tokens per second)","n_prompt_tokens_processed":32,"n_tokens_second":161.80329775346232,"slot_id":0,"t_prompt_processing":197.77100000000002,"t_token":6.1803437500000005,"task_id":10,"tid":"140241139329024","timestamp":1713827080}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time = 1998.17 ms / 102 runs ( 19.59 ms per token, 51.05 tokens per second)","n_decoded":102,"n_tokens_second":51.04668219086354,"slot_id":0,"t_token":19.58991176470588,"t_token_generation":1998.171,"task_id":10,"tid":"140241139329024","timestamp":1713827080}
|
Run on host:
Run inside Docker container:
Notable is the fact that the host driver version is 23.52.28202.51, but the container's version is 23.35.27191.42 (I'm using the image provided under |
1 - Here you go:
You'll note that the second request to
2 - Yes, the slowdown occurs within a single generation round. Initial speed is (guessing) ~35t/s, then it quickly slows down after the first 20 or so tokens. EDIT: For comparison, this is the same query running on the same system, with a single A770:
|
Hi @digitalscream , We have tested the performance of llama3-8b on our dual ARC770 machine and found that it is somewhat slower compared to a single ARC770 card, which is normal for currently. However, we could not replicate the large performance gap that you experienced (115.67ms on dual cards and 25.82ms on single card). Based on our investigation, this might be due to your GPU driver not being properly installed, for example, the version of CLI:
Version: 1.2.13.20230704
Build ID: 00000000
Service:
Version: 1.2.13.20230704
Build ID: 00000000
Level Zero Version: 1.14.0 So please reinstall following our guide https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#for-linux-kernel-6-5. |
@sgwhat - fair enough, I'll give that a go. I must admit, I installed the latest drivers on the host and then just used the provided Do you know how often that image gets updated? I'm happy to build it from a Ubuntu base image, of course, but it'd be much nicer if the provided image was already there :) |
hi @digitalscream , |
OK, I've updated the host drivers - both are showing the same updated version numbers now. However, the same behaviour remains (it's slightly faster, at ~10t/s average over the course of the "Tell me a story in 1000 words" prompt). I've tried running it with both the This is the startup script:
Am I doing anything obviously wrong there (aside from the unused variables, I'm just experimenting to get it working)? |
Your startup script is fine. Could you please double-check the image system environment by running |
Yup, here it is for host and container respectively:
|
hi @digitalscream, you don't need to setup gpu driver in the docker image, and Note: You only need to install gpu driver on host, please don't install any driver in your container. |
I can't really run it on the host system bare at the moment, because of weird Python issues this machine's had for ages (hence running it in Docker). I've just tried running it without the driver in the container - Ollama fails to run any models because it can't find the SYCL devices:
With your dual-A770 machine, are you running it bare or with Docker? Would it help if I gave you the files for my image to see if you can replicate it there? |
Hi @digitalscream, it seems your |
I believe it is, yes - remember, this works perfectly under the same Docker images with a single GPU. I think we might be going down a bit of a blind alley here - that was all with the bare image installed from scratch. I've just gone back to the Could the answer simply be that the second GPU is running on a PCIE 3.0 x4 interface? Weirdly, it does seem to be the one doing most of the work (even though it's listed as /dev/dri/card1, so I'd expect /dev/dri/card0 to be the primary). I can't find any way of monitoring the PCIE bandwidth in use, because this machine has an AMD CPU (sorry...). The only other machine I have with multiple x8 slots is an old Xeon E5 machine, which doesn't support AVX so the Arc drivers won't install. If that's a genuine possibility, I'm OK with sending the second card back (I can't use it on my desktop because the Arc drivers have no fan control under Linux). |
OK, some useful information - if I use:
...to limit Ollama to using either card (0 or 1) in the above I tried forcing it to use OpenCL using the above environment variable, but Ollama fails to load the model. I suspect that's an Ollama problem more than anything. |
hi @digitalscream , could you please try running a single-card task on each of the two GPUs simultaneously? We want to compare their performance to rule out any hardware issues. For your information, we have tested on our dual cards machine, ollama could find sycl devices in our container without installing anything, and you may see the performance below:
|
@sgwhat - results for card0 (x16):
card1 (x4):
Incidentally, it does seem that even when running just one task on one card, it's slower with another GPU present than if there's just one card physically attached. As I said right up at the top, I normally get over 50t/s, but with two GPUs present it's closer to 40 (I have no idea if that's useful information or not). In any case, running two tasks simultaneously does seem to rule out any immediate hardware issue, although there's still the possibility that the x4 link on the second card is holding it back if the cards need to talk to each other directly. For what it's worth, I do have ReBar and >4G decoding enabled, everything else is just stock settings in the BIOS. EDIT: I've just removed the second GPU, and it's still showing the ~40t/s performance, so that's a red herring - I've probably caused that myself somewhere along the line. |
Additional info: I've managed to get a conda environment running on the host machine, and it shows exactly the same behaviour: 37-38t/s for a single card, or starting at that level for the first 20-30 tokens on dual GPUs and then quickly dropping to ~8t/s. |
Hi @digitalscream, sorry, we can't reproduce your issue. It could be possible related to your hardware or GPU driver installation. You may refer to our documentation https://ipex-tllm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#for-linux-kernel-6-5 for more details. |
@sgwhat - for what it's worth, I've kinda solved the problem. It turns out that you were on the right track with the drivers - I had Removing those and updating means that I now see 30t/s when running llama3 on dual GPUs. Weirdly, though, I also see 30t/s when running on a single card (either specifying a single card with two in the machine or with just one physically installed) - exactly the same performance no matter how many cards are installed, but it's definitely using both (from That's the same behaviour I saw when I had a pair of Tesla P100s installed. I think I need to just wipe this server at some point...it's had far too much installed and removed over the last year or so. Thanks for your help! |
Happy to close this one now; I've ended up sending the other card back, because running 20GB+ models (the whole point of the second card) is just too slow to be usable. I'll wait for Battlemage... |
System specs: 2 x A770, one on PCIE 3.0 x16, one on PCIE 3.0 x4, Ryzen 3600, 64GB RAM (EDIT: Host is Ubuntu 22.04, Docker base image is
intelanalytics/ipex-llm-xpu:latest
).When running Ollama in Docker, using the setup described here:
https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html
...it loads large models just fine (eg Codellama 34b, with approx. 10GB on each GPU), but the response generation starts around 30t/s and gradually slows down to 5t/s within the same response.
At the same time, the GPU frequency slows from 1Ghz+ down to 100-300MHz.
The GPUs aren't power-starved, and they're running in open air (so no cooling problems, ambient temp is ~18C).
Worth noting that I'm running with
source ipex-llm-init -c -g
in the container startup and--no-mmap
on llama.cpp (it segfaults without the former, and hangs or outright crashes without the latter).I know it's not ideal running the second GPU on PCIE x4, but I'd have thought that'd cause general slowness rather than a gradual slowdown.
It shows the exact same behaviour with smaller models, too - eg LLama 3 8b and Mistral 7b. With those smaller models, running on a single GPU returns 50-60t/s.
The text was updated successfully, but these errors were encountered: