Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama.cpp not working with intel ARC 770? #7042

Closed
SergioVargasRamirez opened this issue May 2, 2024 · 17 comments
Closed

Llama.cpp not working with intel ARC 770? #7042

SergioVargasRamirez opened this issue May 2, 2024 · 17 comments

Comments

@SergioVargasRamirez
Copy link

Hi,

I am trying to get llama.cpp to work on a workstation with one ARC 770 Intel GPU but somehow whenever I try to use the GPU, llama.cpp does something (I see the GPU being used for computation using intel_gpu_top) for 30 seconds or so and then just hang there, using 100% CPU (but only one core) as if it would be waiting for something to happen...

I am using the following command:

ZES_ENABLE_SYSMAN=0 ./main -m ~/LLModels/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_S.gguf -p "Alice lived in Wonderland and her favorite food was:" -n 512 -e -ngl 33 -sm none -mg 0 -t 32

it doesn't matter if I ignore the ZES_ENABLE_SYSMAN part.

now, the same biuld does work when -ngl 0. There I see the 32 cores be used and the model produces output.

If I run clinfo, I get the following output.

Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM)
 `-- Device #0: Intel(R) FPGA Emulation Device
Platform #1: Intel(R) OpenCL
 `-- Device #0: AMD Ryzen 9 5950X 16-Core Processor
Platform #2: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics

sycl-ls:

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 9 5950X 16-Core Processor             OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [24.18.0]

ls-sycl-device sees three SYCL devices on is the GPU.

found 3 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 1|    [opencl:cpu:0]|AMD Ryzen 9 5950X 16-Core Processor            |       3.0|         32|    8192|     64|   134983352320|
| 2|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         32|67108864|     64|   134983352320|

Now I don't see the level-zero device, I had it at some point but had no opencl:gpu in exchange. With the level-zero device I had the same problem. The GPU will activate for 30s and go back to zero activity while ./main stays on guard for hours if I don't cancel.

I a running OpenSuse Tumbleweed and installed intel oneAPI locally using the online installer. I don't see compilation issues. I also compiled Neo and its requirements. All these packages are in my home but that doesn't seem to be the issue because previously installed intel packages (via zypper) where avail. system-wide with the same results.

I am really lost here because I don't seem to be getting any error, I am sure the GPU crashes but I don't know why or where to look for this info. So I would really appreciate your help on this. I can test anything you want (this is not a production system or else).

thanks in advance and best regards,

Sergio

@Jeximo
Copy link
Contributor

Jeximo commented May 2, 2024

now, the same biuld does work when -ngl 0. There I see the 32 cores be used and the model produces output.

If I understand correctly, then CPU works as expected, but not GPU. I think there needs to be more information provided on how you built llama.cpp, cmake?

@SergioVargasRamirez
Copy link
Author

Yes, exactly. I use cmake, yes. CPU works. With GPU the program waits for something to happen but never ends... just waits. I see something happening with the GPU for like 30s before usage goes to 0 and then nothing happens.

cmake -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON ..

I tried with the intel packages from the intel repo and compiling them as well and I get the same results. GPU never finishes

@jwhitehorn
Copy link

Sounds like maybe the non-free firmware isn't installed.

@sevragorgia
Copy link

will check, thanks

@simonlui
Copy link

simonlui commented May 3, 2024

You need the output to show level-zero or you won't be able to use the SYCL backend properly. There was a regression with Linux kernel 6.8.x and the compute runtime not doing the right thing that caused that. The freeze also isn't a Llama.cpp issue, it's a kernel issue with Intel's drivers and compute runtime. See the below two issues for more details.
intel/compute-runtime#710
intel/compute-runtime#726
Given the churn and everything, it's highly suggested to stay on Linux kernel 6.6.26 or lower LTS kernel if you can because running the latest kernel with the patches to solve both problems has reduced performance anyways and there's a lot of code changes and churn going on with whatever is happening on the Intel side to cause these issues.

@sevragorgia
Copy link

thanks @simonlui that certainly makes my life a bit harder having installed tumbleweed mainly to get driver support for ARC770. Would you suggest simply going for Ubuntu? I think support there is a bit better for Intel GPUs.

cheers

@simonlui
Copy link

simonlui commented May 3, 2024

Ubuntu is the only officially supported consumer distro by Intel for Arc consumer GPUs. They support Red Hat and SUSE Enterpise also but only if you can pay for Arc Datacenter GPUs. See their dGPU install guide for Linux.
I run Fedora and the custom packages provided are fine. The only things I am missing is a way to run some things without containers and missing some capabilities the modified kernel modules provide better system statistics for the GPU in monitoring tools like xpu-smi. But I am stuck with the latest kernel packages like you are on Tumbleweed so yeah, that is also now a downside.

@jwhitehorn
Copy link

jwhitehorn commented May 3, 2024

The kernel issue is a good point. I downgraded to 6.8.4 and it runs fine there. That's the most recent kernel that works, all new revision has a regression, but...

The fact that @SergioVargasRamirez doesn't see the SYCL device isn't indicative of being on the wrong kernel. That issue manifests itself by the SYCL device being visible, but attempts to use it just hang. The fact that the Arc A770 is missing as a SYCL device hints at something different.

In terms of OS, I find it easier to run my LLM work within a docker container. So I can run the official tooling from Intel on Ubunutu 22, without having to worry about changing my host OS. All my host OS needs to worry about is ensure that the non-free firmware is loaded and that I'm not running a kernel newer than 6.8.4.

@sevragorgia
Copy link

I had a SYCL device with the packages from the opensuse intel repo. Kernel in Tumbleweed is 6.8.7, I think. I will need to check at home but I am sure is >6.8.4.

I will give it a try using docker.

If I understand correctly, you don't need any of the intel oneapi, etc. stuff in the host, just the firmware from intel (i915, I guess) and the right kernel.

It is a bit silly from intel to support only ubuntu, I think. Why not offer support for .deb and .rpm based distros?

thanks for all your help this had helped me a lot.

@simonlui
Copy link

simonlui commented May 3, 2024

Guys, you are mixing issues. Showing up as SYCL with Level Zero in sycl-ls depends on either your Linux kernel being older than 6.8 or having a new enough compute runtime which is the 710 issue. The hang is if you have 6.8.4 or higher due to faulty kernel patches, that was only recently resolved with patches in Intel's own Linux build which is what 726 was about. And if 726 was fixed, it lowered performance a lot.

@sevragorgia Arc by itself will work in any Linux distro with the right packages repackaged which all of them at this point do but only Ubuntu will ever offer the "full experience" with the custom kernel modules and other packages which are optional. oneAPI basekit and compute runtime at minimum is required to compile and run Llama.cpp but that can be done either in Docker or host, oneAPI has .deb and .rpm or etc. custom repositories out there Intel are hosting.

@NeoZhangJianyu
Copy link
Collaborator

@SergioVargasRamirez
1.
SYCL backend recommend to use level-zero device. It has better performance than openCL on Intel GPU.
I suggest to install level-zero running time and try again.
Because it can't run on openCL well I found recently. But I don't check the root cause and when it's out of work.
openCL is not focused by SYCL backend.

If you want to run llama.cpp on openCL, use CLBlas (openCL) backend is an optional.

I meet some hang issue recently, most of solution is downgrade kernel or driver.
You could try upgrade to latest kernel and driver, or downgrade to stable kernel or driver.

If you still meet issue, please try with the latest llama.cpp code and paste whole running log here.
There are more debug info in the log.

Hope above info can help you!

@SergioVargasRamirez
Copy link
Author

thanks. I installed Ubuntu 22 now and IPEX-LLM works fine. Will try lama.cpp probably today but I don't see why it will not work out of the box because it already works fine with the GPU if installed from IPEX-LLM. Ollama also works.

If the compiled version of llama.cpp works I will try to do as suggested above: go back to opensuse (this time Leap 15.6 instead of tumbleweed) and try to install llama.cpp and ipex-llm in a docker container running ubuntu22 to see if I get the GPU to work in opensuse with this.

I am not too attached to any distro but I work with opensuse at work and would like to have the same system at home. I can report back.

Now... In ubuntu it works fine. But, since I need an older version of the intel packages to run ipex-llm the system keeps asking me to upgrade, which is annoying. Therefore I tend to favor the containerized solution proposed by @jwhitehorn .

@SergioVargasRamirez
Copy link
Author

Just wanted to report back.

I compiled llama.cpp on ubuntu 22 with the intel packages from the intel repo installed via apt.

All working fine now. Intel Arc 770 GPU is used to generate text. I tried different models and some generate weird output, like non-sense or cyrillic symbols or german text, mixed with english. No idea. All with GPU.

Like I wrote before I will try to containerize this to avoid ubuntu constantly asking to upgrade the intel packages and breaking ipex-llm or llama.cpp.

I will also try to run the containerized solution in OpenSuse 15.6 after I tested the containers in ubuntu.

thanks for all your help!

@NeoZhangJianyu
Copy link
Collaborator

Thank you feedback!

@SergioVargasRamirez
Copy link
Author

Gonna leave this here just to document the process in case it helps other users.

I installed OpenSuse Leap 15.6 on my PC. The kernel version is 6.4.0. I understand the kernel pulls GPU support from the backports. The system is pretty clean because I have not installed much besides docker, git-lfs, intel-gpu-tools, and htop.

I get the Arc 770 fully supported out of the box. So, instead of going through the previous path of installing the intel packages, I did, as recommended, all the LLM stuff via docker.

I can confirm that the IPEX-LLM docker image sees the GPU via SYCL, which is not installed on my PC. I tested the GPU via the test script provided by the image and can confirm that the GPU is used by the chat.py script.

I cloned llama.cpp and compiled it in the container following https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md#linux and tested it with the instructions provided there but using mistral-7b-instruct-v0.1.Q4_0.gguf and:

  1. I don't see any errors during compilation,
  2. can confirm ls-sycl-device works listing both the level_zero and opencl devices,
  3. the GPU is computing :-)

I used the example call ("create a website in 10 steps") asking for 512 tokens and got 57.93 tokens per second in the prompt eval time and 19.46 tokens per second in the eval time and another one in which I asked for a 10,000 words essay about AI. The GPU is used at about 21%, 2300/2400MHz, and 166500irqs/s (whatever this means, not familiar with this stuff) for most of the computation. I never get 10k words, no idea why and I have not tried to change any flags yet. If somebody knows how to get 10000 words back I appreciate any advice in advance.

So, I guess that's it from my side.

Again, thanks for all the help. I hope this info here helps other users.

cheers
Sergio

@NeoZhangJianyu
Copy link
Collaborator

@SergioVargasRamirez
Thank you feedback!
-n is about the output token numbers.
Change it and try again.

@SergioVargasRamirez
Copy link
Author

changed that, but the model kept outputting 400-600 words

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants