Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

Open
laooopooo opened this issue May 7, 2024 · 6 comments
Open

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

laooopooo opened this issue May 7, 2024 · 6 comments

Comments

@laooopooo
Copy link

laooopooo commented May 7, 2024

windows 10
llamafile-0.8.1

llamafile.exe -m Damysus-2.7B-Chat.Q8_0.gguf -ngl 9999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA P104-100, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size = 0.42 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 130.47 MiB
llm_load_tensors: CUDA0 buffer size = 2684.12 MiB
.............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.20 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 108.24 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 6.01 MiB
llama_new_context_with_model: graph nodes = 1161
llama_new_context_with_model: graph splits = 2
ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900
ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900
ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900
...
CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_op_mul_mat at ggml-cuda.cu:10723
ggml_cuda_cpy_tensor_2d(src0_dd_i, src0, i03, i02/i02_divisor, dev[id].row_low, dev[id].row_high, stream)
GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error"


but same env
llamafile-0.8 is okey
log-----------------------------------------------------

llamafile-0.8.exe -m Damysus-2.7B-Chat.Q8_0.gguf -ngl 9999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA P104-100, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size = 0.42 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 130.47 MiB
llm_load_tensors: CUDA0 buffer size = 2684.12 MiB
.............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.20 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 108.24 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 6.01 MiB
llama_new_context_with_model: graph nodes = 1161
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":485,"msg":"initializing slots","n_slots":1,"tid":"9434528","timestamp":1715074257}
{"function":"initialize","level":"INFO","line":494,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"9434528","timestamp":1715074257}
{"function":"server_cli","level":"INFO","line":3080,"msg":"model loaded","tid":"9434528","timestamp":1715074257}

llama server listening at http://127.0.0.1:8080

opening browser tab... (pass --nobrowser to disable)
failed to open http://127.0.0.1:8080/ in a browser tab using /c/windows/explorer.exe: process exited with non-zero status
{"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3203,"msg":"HTTP server listening","port":"8080","tid":"9434528","timestamp":1715074258}
{"function":"update_slots","level":"INFO","line":1639,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"9434528","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/completion.js","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334538528","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/index.js","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/json-schema-to-grammar.mjs","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334539824","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/history-template.txt","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258}
{"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/prompt-template.txt","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334539824","timestamp":1715074258}

@Janghou
Copy link

Janghou commented May 9, 2024

FYI same error, trying llamafile (0.8.1) on a AMD 4800u / Linux Ubuntu 22.04 with ROCM:
./Phi-3-mini-4k-instruct.Q4_K_M.llamafile -ngl 9999

ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at ggml-cuda.cu:11444
  err
GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error"

@Gabri94x
Copy link

Gabri94x commented May 9, 2024

same, running phi-3 on windows

@Janghou
Copy link

Janghou commented May 9, 2024

Not exactly sure why, but after trying out and a reboot, it suddenly started working by setting this environment variable:

HSA_OVERRIDE_GFX_VERSION=9.0.0 ./Phi-3-mini-4k-instruct.Q6_K.llamafile -ngl 9999

That said, it's not really faster on iGPU with an AMD 4800u, but CPU usage is much lower (only 1 thread), so that's the win here.

FYI installed ROCM 6.1.1:
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

As it seems fx90c is officially not supported but by adding the override it just works.

The max UMA FB (GPU memory) size I can set in BIOS is 4GB, so it can run the Q6_K model.

@laooopooo
Copy link
Author

laooopooo commented May 10, 2024

I took a closer look at llama.cpp upgrade log (Flash Attention), this llama.cpp upgrade, for GPU to enable tensor core, so some time it will crash, llama.cpp is not enabled by default, but llamafile seems to be enabled by default, so it will throw a bug, but in the actual test, it is also related to GGUF files(It may have something to do with quantification), some enable FA and it will not collapse, and some will collapse, all in all, this upgrade is not very friendly, the problem of the new version of llamafile-0.8.2 still exists, and it seems that the developers have not seen this problem

@jart
Copy link
Collaborator

jart commented May 10, 2024

Thank you @laooopooo. Does it work for you if you add -DGGML_CUDA_FORCE_MMQ?

@laooopooo
Copy link
Author

laooopooo commented May 11, 2024

Thank you @laooopooo. Does it work for you if you add -DGGML_CUDA_FORCE_MMQ?

I referenced this:ggerganov/llama.cpp#6529
set -arch=native,compiled,it work okay,
set -arch=all,compiled,so big,but it work okay,
set -arch=all-major,compiled,it not work,
add or not add -DGGML_CUDA_FORCE_MMQ ,both it work
Guess my card arch not include all-major
Over,so close this topic,thank you

nvcc --shared ^
     -arch=native ^
     --forward-unknown-to-host-compiler ^
     -Xcompiler="/nologo /EHsc /O2 /GR /MT" ^
     -DNDEBUG ^
     -DGGML_BUILD=1 ^
     -DGGML_SHARED=1 ^
     -DGGML_CUDA_MMV_Y=1 ^
     -DGGML_CUDA_FORCE_MMQ ^
     -DGGML_MULTIPLATFORM ^
     -DGGML_CUDA_DMMV_X=32 ^
     -DK_QUANTS_PER_ITERATION=2 ^
     -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 ^
     -DGGML_MINIMIZE_CODE_SIZE ^
     -DGGML_USE_TINYBLAS ^
     -o ggml-cuda.dll ^
     ggml-cuda.cu ^
     -lcuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants