Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan on win11 #826

Open
fenixlam opened this issue May 4, 2024 · 1 comment
Open

vulkan on win11 #826

fenixlam opened this issue May 4, 2024 · 1 comment

Comments

@fenixlam
Copy link

fenixlam commented May 4, 2024

OS is win11, I notice koboldcpp 1.64.1 has vulkan driver support, so I make a nice try with my AMD 6800U, 32GB ram, 3GB vram with GPU shared memory. Its total vram could be boosted to 17GB. It has vulkan driver showing in AMD software (Vulkan version 2.0.299, vulkan API 1.3.277 ) . This time, I try Loyal-Toppy-Bruins-Maid-7B-DARE-Q8_0-imatrix.gguf, which works in --useclblast 0 0.

I suspect vulkan driver didn't support shared memory, task manager didn't show system load the model in shared memory, I still choose to report this, just in case if it comes from bugs or something.

koboldcpp_1641.exe --threads 12 --host 0.0.0.0 --port 5002 --noshift --smartcontext --contextsize 16384 --blasbatchsize 2048 --usevulkan 0 0 --gpulayers 2
***
Welcome to KoboldCpp - Version 1.64.1
For command line arguments, please refer to --help
***
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(benchmark=None, blasbatchsize=2048, blasthreads=12, chatcompletionsadapter='', config=None, contextsize=16384, debugmode=0, flashattention=False, forceversion=0, foreground=False, gpulayers=2, highpriority=False, hordeconfig=None, host='0.0.0.0', ignoremissing=False, launch=False, lora=None, mmproj='', model=None, model_param='D:/program/koboldcpp/Loyal-Toppy-Bruins-Maid-7B-DARE-Q8_0-imatrix.gguf', multiuser=0, noavx2=False, noblas=False, nocertify=False, nommap=False, noshift=True, onready='', password=None, port=5002, port_param=5001, preloadstory='', quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=12, useclblast=None, usecublas=None, usemlock=False, usevulkan=[0, 0])
==========
Loading model: D:\program\koboldcpp\Loyal-Toppy-Bruins-Maid-7B-DARE-Q8_0-imatrix.gguf
[Threads: 12, BlasThreads: 12, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\program\koboldcpp\Loyal-Toppy-Bruins-Maid-7B-DARE-Q8_0-imatrix.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_vulkan: Found 2 Vulkan devices:
Vulkan0: AMD Radeon(TM) Graphics | uma: 1 | fp16: 1 | warp size: 64
Vulkan1: AMD Radeon(TM) Graphics | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.69 MiB
llm_load_tensors: offloading 2 repeating layers to GPU
llm_load_tensors: offloaded 2/33 layers to GPU
llm_load_tensors:        CPU buffer size =  2283.62 MiB
llm_load_tensors:        CPU buffer size =  6763.75 MiB
llm_load_tensors:    Vulkan0 buffer size =   221.03 MiB
llm_load_tensors:    Vulkan1 buffer size =   221.03 MiB
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1920.00 MiB
llama_kv_cache_init:    Vulkan0 KV buffer size =    64.00 MiB
llama_kv_cache_init:    Vulkan1 KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =  1153.00 MiB
llama_new_context_with_model:    Vulkan1 compute buffer size =  1088.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 335
Traceback (most recent call last):
  File "koboldcpp.py", line 3330, in <module>
  File "koboldcpp.py", line 3073, in main
  File "koboldcpp.py", line 396, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[130728] Failed to execute script 'koboldcpp' due to unhandled exception!
@henk717
Copy link

henk717 commented May 11, 2024

I have the same cpu as you but my bios only allows 512MB dedicated vram. So I can confirm vulkan has shared memory. Versions 1.62 to 1.64.1 are known to have vulkan bugs though try again with 1.65.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants