Flex 170 GPU ollama unbale to detect GPU and sycl-ls also not detecting it !!! #10801

shailesh837 · 2024-04-18T21:54:16Z

I am testing old ollama bin file with llama2:7b model on Intel Flex 170 GPU:
I have followed new ipex-llm documentation for driver install and rest steps, but when i am running ollama serve it doesn't detect GPU also i see some dll error when ollama serve run.
Attaching ollama serve log in issue :
ollama_serve_flex_170.log

spandey2@IMU-NEX-EMR1-SUT:~/LLM_SceneScape_ChatBot$ sudo xpu-smi discovery
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Data Center GPU Flex 170
|
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0000-dc8d-d308092ec026 |
| | PCI BDF Address: 0000:29:00.0 |
| | DRM Device: /dev/dri/card1 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+

spandey2@IMU-NEX-EMR1-SUT:~/LLM_SceneScape_ChatBot$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, INTEL(R) XEON(R) PLATINUM 8580 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

Modelfile :
FROM llama2:latest
PARAMETER num_gpu 999
PARAMETER temperature 0
PARAMETER num_ctx 4096
PARAMETER use_mmap false

Ollama serve log:
time=2024-04-18T23:44:00.314+02:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-18T23:44:00.315+02:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27642.40]"
time=2024-04-18T23:44:00.345+02:00 level=INFO source=gpu.go:377 msg="Unable to load oneAPI management library /usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27642.40: oneapi vram init failure: 2013265921"
time=2024-04-18T23:44:00.345+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-18T23:44:00.345+02:00 level=INFO source=routes.go:1044 msg="no GPU detected"
[GIN] 2024/04/18 - 23:45:10 | 200 | 120.449µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/04/18 - 23:45:10 | 200 | 1.357185ms | 127.0.0.1 | GET "/api/tags"
time=2024-04-18T23:46:49.533+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-18T23:46:49.533+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-18T23:46:49.533+02:00 level=INFO source=llm.go:77 msg="GPU not available, falling back to CPU"
loading library /tmp/ollama1871007333/cpu_avx2/libext_server.so

llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CPU input buffer size = 16.02 MiB
llama_new_context_with_model: CPU compute buffer size = 308.00 MiB

sudo xpu-smi stats -d 0 IMU-NEX-EMR1-SUT: Thu Apr 18 23:49:41 2024

+------------------------ | Device ID +------------------------ | GPU Utilization (%) | EU Array Active (%) | EU Array Stall (%) | EU Array Idle (%) | | Compute Engine Util (%) | Render Engine Util (%) | Media Engine Util (%) | Decoder Engine Util (%) | Encoder Engine Util (%) | Copy Engine Util (%) | Media EM Engine Util (%) | 3D Engine Util (%) +------------------------ | Reset | Programming Errors | Driver Errors | Cache Errors Correctable | Cache Errors Uncorrectable | N/A | Mem Errors Correctable | Mem Errors Uncorrectable +------------------------ | GPU Power (W) | GPU Frequency (MHz) | Media Engine Freq (MHz) | GPU Core Temperature (C) | 87 | GPU Memory Temperature (C) | N/A | GPU Memory Read (kB/s) | GPU Memory Write (kB/s) | GPU Memory Bandwidth (%) | 0 | GPU Memory Used (MiB) | GPU Memory Util (%) | Xe Link Throughput (kB/s) | N/A +------------------------ -----+--------------------------------------------------------------------+
| 0 |
-----+--------------------------------------------------------------------+
| 0 |
| N/A |
| N/A |
| N/A |
| |
| 0; Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0 |
| 0; Engine 0: 0 |
| 0 |
| Engine 0: 0, Engine 1: 0 |
| Engine 0: 0, Engine 1: 0 |
| 0; Engine 0: 0 |
| Engine 0: 0, Engine 1: 0 |
| N/A |
-----+--------------------------------------------------------------------+
| N/A |
| N/A |
| N/A |
| N/A |
|
| N/A |
| N/A |
-----+--------------------------------------------------------------------+
| 43 |
| 2050 |
| 1025 |
|
|
| N/A |
| N/A |
|
| 31 |
| 0 |
|
-----+--------------------------------------------------------------------+

sgwhat · 2024-04-19T01:34:20Z

Hi Shailesh, I didn't see any gpu device from sycl-ls in your log. Could you please check your oneapi installation and remember to source /opt/intel/oneapi/setvars.sh?

spandey2@IMU-NEX-EMR1-SUT:~/LLM_SceneScape_ChatBot$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, INTEL(R) XEON(R) PLATINUM 8580 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

shailesh837 · 2024-04-19T08:18:30Z

Issue was libllama_bigdl_core.so was missing in /usr/lib folder, But there are 2 important issues we are seeing :
a) why we need to create Modelfile with below parameters:
FROM llama2:latest
PARAMETER num_gpu 999
PARAMETER temperature 0
PARAMETER num_ctx 4096
PARAMETER use_mmap false
As older version of ollama was working without modelfile and we don't need to set it to GPU, it was running all layers to GPU.

b) The Loading of model by ollama serve takes 50 seconds and response takes 20 seconds ? why
Make model and same ollama serve on Arc GPU 770 [16 GB GPU memory , same as Flex 170] does loading in 10 seconds or less and response is 1-2 seconds

sgwhat · 2024-04-22T02:07:37Z

Hi @shailesh837,

You may switch to our latest release version of Ollama by using the command pip install --pre --upgrade ipex-llm[cpp]. In this version, libllama_bigdl_core.so is no longer required.
In the latest version of Ollama, it is not necessary to set PARAMETER num_gpu 999 in the Model file. The usage remains the same as in the previous versions.
Please ensure that Ollama is running on a GPU device. We will investigate the causes of the reduced performance on Flex170.

sgwhat added the user issue label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flex 170 GPU ollama unbale to detect GPU and sycl-ls also not detecting it !!! #10801

Flex 170 GPU ollama unbale to detect GPU and sycl-ls also not detecting it !!! #10801

shailesh837 commented Apr 18, 2024

sgwhat commented Apr 19, 2024 •

edited

shailesh837 commented Apr 19, 2024

sgwhat commented Apr 22, 2024

Flex 170 GPU ollama unbale to detect GPU and sycl-ls also not detecting it !!! #10801

Flex 170 GPU ollama unbale to detect GPU and sycl-ls also not detecting it !!! #10801

Comments

shailesh837 commented Apr 18, 2024

sgwhat commented Apr 19, 2024 • edited

shailesh837 commented Apr 19, 2024

sgwhat commented Apr 22, 2024

sgwhat commented Apr 19, 2024 •

edited