GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

laooopooo · 2024-05-07T09:33:05Z

windows 10
llamafile-0.8.1

llamafile.exe -m Damysus-2.7B-Chat.Q8_0.gguf -ngl 9999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA P104-100, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size = 0.42 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 130.47 MiB
llm_load_tensors: CUDA0 buffer size = 2684.12 MiB
.............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.20 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 108.24 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 6.01 MiB
llama_new_context_with_model: graph nodes = 1161
llama_new_context_with_model: graph splits = 2
ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900
ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900
ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900
...
CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_op_mul_mat at ggml-cuda.cu:10723
ggml_cuda_cpy_tensor_2d(src0_dd_i, src0, i03, i02/i02_divisor, dev[id].row_low, dev[id].row_high, stream)
GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error"

but same env
llamafile-0.8 is okey
log-----------------------------------------------------

llamafile-0.8.exe -m Damysus-2.7B-Chat.Q8_0.gguf -ngl 9999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA P104-100, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size = 0.42 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 130.47 MiB
llm_load_tensors: CUDA0 buffer size = 2684.12 MiB
.............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.20 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 108.24 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 6.01 MiB
llama_new_context_with_model: graph nodes = 1161
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":485,"msg":"initializing slots","n_slots":1,"tid":"9434528","timestamp":1715074257}
{"function":"initialize","level":"INFO","line":494,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"9434528","timestamp":1715074257}
{"function":"server_cli","level":"INFO","line":3080,"msg":"model loaded","tid":"9434528","timestamp":1715074257}

llama server listening at http://127.0.0.1:8080

opening browser tab... (pass --nobrowser to disable)
failed to open http://127.0.0.1:8080/ in a browser tab using /c/windows/explorer.exe: process exited with non-zero status
{"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3203,"msg":"HTTP server listening","port":"8080","tid":"9434528","timestamp":1715074258}
{"function":"update_slots","level":"INFO","line":1639,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"9434528","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/completion.js","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334538528","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/index.js","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/json-schema-to-grammar.mjs","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334539824","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/history-template.txt","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/prompt-template.txt","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334539824","timestamp":1715074258}

Janghou · 2024-05-09T08:24:15Z

FYI same error, trying llamafile (0.8.1) on a AMD 4800u / Linux Ubuntu 22.04 with ROCM:
./Phi-3-mini-4k-instruct.Q4_K_M.llamafile -ngl 9999

ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at ggml-cuda.cu:11444
  err
GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error"

Gabri94x · 2024-05-09T13:16:58Z

same, running phi-3 on windows

Janghou · 2024-05-09T16:51:39Z

Not exactly sure why, but after trying out and a reboot, it suddenly started working by setting this environment variable:

HSA_OVERRIDE_GFX_VERSION=9.0.0 ./Phi-3-mini-4k-instruct.Q6_K.llamafile -ngl 9999

That said, it's not really faster on iGPU with an AMD 4800u, but CPU usage is much lower (only 1 thread), so that's the win here.

FYI installed ROCM 6.1.1:
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

As it seems fx90c is officially not supported but by adding the override it just works.

The max UMA FB (GPU memory) size I can set in BIOS is 4GB, so it can run the Q6_K model.

laooopooo · 2024-05-10T01:27:59Z

I took a closer look at llama.cpp upgrade log (Flash Attention), this llama.cpp upgrade, for GPU to enable tensor core, so some time it will crash, llama.cpp is not enabled by default, but llamafile seems to be enabled by default, so it will throw a bug, but in the actual test, it is also related to GGUF files(It may have something to do with quantification), some enable FA and it will not collapse, and some will collapse, all in all, this upgrade is not very friendly, the problem of the new version of llamafile-0.8.2 still exists, and it seems that the developers have not seen this problem

jart · 2024-05-10T13:46:26Z

Thank you @laooopooo. Does it work for you if you add -DGGML_CUDA_FORCE_MMQ?

laooopooo · 2024-05-11T06:17:47Z

Thank you @laooopooo. Does it work for you if you add -DGGML_CUDA_FORCE_MMQ?

I referenced this:ggerganov/llama.cpp#6529
set -arch=native,compiled,it work okay,
set -arch=all,compiled,so big,but it work okay,
set -arch=all-major,compiled,it not work,
add or not add -DGGML_CUDA_FORCE_MMQ ,both it work
Guess my card arch not include all-major
Over,so close this topic,thank you

nvcc --shared ^
     -arch=native ^
     --forward-unknown-to-host-compiler ^
     -Xcompiler="/nologo /EHsc /O2 /GR /MT" ^
     -DNDEBUG ^
     -DGGML_BUILD=1 ^
     -DGGML_SHARED=1 ^
     -DGGML_CUDA_MMV_Y=1 ^
     -DGGML_CUDA_FORCE_MMQ ^
     -DGGML_MULTIPLATFORM ^
     -DGGML_CUDA_DMMV_X=32 ^
     -DK_QUANTS_PER_ITERATION=2 ^
     -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 ^
     -DGGML_MINIMIZE_CODE_SIZE ^
     -DGGML_USE_TINYBLAS ^
     -o ggml-cuda.dll ^
     ggml-cuda.cu ^
     -lcuda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

laooopooo commented May 7, 2024 •

edited

Janghou commented May 9, 2024 •

edited

Gabri94x commented May 9, 2024 •

edited

Janghou commented May 9, 2024 •

edited

laooopooo commented May 10, 2024 •

edited

jart commented May 10, 2024

laooopooo commented May 11, 2024 •

edited

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

Comments

laooopooo commented May 7, 2024 • edited

windows 10 llamafile-0.8.1

Janghou commented May 9, 2024 • edited

Gabri94x commented May 9, 2024 • edited

Janghou commented May 9, 2024 • edited

laooopooo commented May 10, 2024 • edited

jart commented May 10, 2024

laooopooo commented May 11, 2024 • edited

laooopooo commented May 7, 2024 •

edited

windows 10
llamafile-0.8.1

Janghou commented May 9, 2024 •

edited

Gabri94x commented May 9, 2024 •

edited

Janghou commented May 9, 2024 •

edited

laooopooo commented May 10, 2024 •

edited

laooopooo commented May 11, 2024 •

edited