Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Ollama models apparently affected by llama.cpp BPE pretokenization issue #4126

Open
sealad886 opened this issue May 3, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@sealad886
Copy link

sealad886 commented May 3, 2024

What is the issue?

See the following llama.cpp issues/PRs:

  • PR 6920: llama : improve BPE pre-processing + LLaMA 3 and Deepseek support
  • Issue 7030: Command-R GGUF conversion no longer working
  • Issue 7040: Command-R-Plus unable to convert or use after BPE pretokenizer update
  • many others regarding various models either spitting jibberish or otherwise not working

Using updated llama.cpp builds and having done a little digging under the hood on the BPE issue, this is an example verbose output when starting ollama serve:

time=2024-05-03T14:01:02.120+01:00 level=INFO source=images.go:828 msg="total blobs: 36"
time=2024-05-03T14:01:02.124+01:00 level=INFO source=images.go:835 msg="total unused blobs removed: 0"
time=2024-05-03T14:01:02.125+01:00 level=INFO source=routes.go:1071 msg="Listening on 127.0.0.1:11434 (version 0.1.33)"
time=2024-05-03T14:01:02.125+01:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/b8/br9qpd7x3md9qcdzps_58h240000gn/T/ollama1317780243/runners
time=2024-05-03T14:01:02.153+01:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]"
time=2024-05-03T14:01:20.990+01:00 level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=41 memory.available="27648.0 MiB" memory.required.full="22869.9 MiB" memory.required.partial="22869.9 MiB" memory.required.kv="2560.0 MiB" memory.weights.total="19281.9 MiB" memory.weights.repeating="17641.2 MiB" memory.weights.nonrepeating="1640.7 MiB" memory.graph.full="516.0 MiB" memory.graph.partial="516.0 MiB"
time=2024-05-03T14:01:20.990+01:00 level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=41 memory.available="27648.0 MiB" memory.required.full="22869.9 MiB" memory.required.partial="22869.9 MiB" memory.required.kv="2560.0 MiB" memory.weights.total="19281.9 MiB" memory.weights.repeating="17641.2 MiB" memory.weights.nonrepeating="1640.7 MiB" memory.graph.full="516.0 MiB" memory.graph.partial="516.0 MiB"
time=2024-05-03T14:01:20.991+01:00 level=INFO source=server.go:289 msg="starting llama server" cmd="/var/folders/b8/br9qpd7x3md9qcdzps_58h240000gn/T/ollama1317780243/runners/metal/ollama_llama_server --model /Users/andrew/.ollama/models/blobs/sha256-8a9611e7bca168be635d39d21927d2b8e7e8ea0b5d0998b7d5980daf1f8d4205 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 62223"
time=2024-05-03T14:01:21.030+01:00 level=INFO source=sched.go:340 msg="loaded runners" count=1
time=2024-05-03T14:01:21.030+01:00 level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"0x1f56dbac0","timestamp":1714741281}
{"build":2770,"commit":"952d03d","function":"main","level":"INFO","line":2823,"msg":"build info","tid":"0x1f56dbac0","timestamp":1714741281}
{"function":"main","level":"INFO","line":2830,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"0x1f56dbac0","timestamp":1714741281,"total_threads":12}
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from /Users/andrew/.ollama/models/blobs/sha256-8a9611e7bca168be635d39d21927d2b8e7e8ea0b5d0998b7d5980daf1f8d4205 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = c4ai-command-r-v01
llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
llama_model_loader: - kv   3:                   command-r.context_length u32              = 131072
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.062500
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   41 tensors
llama_model_loader: - type q4_0:  280 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 1008/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = command-r
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 8192
llm_load_print_meta: n_embd_v_gqa     = 8192
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 6.2e-02
llm_load_print_meta: n_ff             = 22528
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 8000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 35B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 34.98 B
llm_load_print_meta: model size       = 18.83 GiB (4.62 BPW) 
llm_load_print_meta: general.name     = c4ai-command-r-v01
llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
llm_load_tensors: ggml ctx size =    0.34 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 19281.92 MiB, (19282.00 / 27648.00)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =  1640.62 MiB
llm_load_tensors:      Metal buffer size = 19281.91 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  2560.00 MiB, (21847.88 / 27648.00)
llama_kv_cache_init:      Metal KV buffer size =  2560.00 MiB
llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   516.00 MiB, (22363.88 / 27648.00)
llama_new_context_with_model:      Metal compute buffer size =   516.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 1208
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"0x1f56dbac0","timestamp":1714741287}
{"function":"initialize","level":"INFO","line":460,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"0x1f56dbac0","timestamp":1714741287}
{"function":"main","level":"INFO","line":3067,"msg":"model loaded","tid":"0x1f56dbac0","timestamp":1714741287}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3270,"msg":"HTTP server listening","n_threads_http":"11","port":"62223","tid":"0x1f56dbac0","timestamp":1714741287}
{"function":"update_slots","level":"INFO","line":1581,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x1f56dbac0","timestamp":1714741287}

Calling python code essentially distills down to:

response = ollama.generate('command-r', system=system, prompt=prompt, keep_alive='1m', stream=False, raw=False)['response']

I think the fix will be re-converting and re-quantizing all of these models, which is what the folks in llama.cpp-world are doing now.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.1.33

@sealad886 sealad886 added the bug Something isn't working label May 3, 2024
@Kalki5
Copy link

Kalki5 commented May 9, 2024

Any update on this?

@samssausages
Copy link

I found this post because I'm getting the same message and trying to find ways to deal with that:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 1008/256000 ).

@dpublic
Copy link

dpublic commented May 13, 2024

Will this llama.cpp merge ggerganov/llama.cpp#6965, fix this issue?
The llama.cpp commit link in ollama is dated 4/30 and ggerganov/llama.cpp#6965 was merged to llama.cpp on 5/9.
So, it doesn't look like this merge was included with the last 0.1.37 ollama release.

@mjtechguy
Copy link

Curious about this as well. Hopeful the updated llamacpp will be merged and models updated.

@coxfrederic
Copy link

I'm having the same issue

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!

Does anyone know what can be done about it? or explain the issue to a "newbie" in Ollama / AI ?

@jakobthun
Copy link

Seeing the same message. Running llama3:70b-instruct

@thiago-buarqque
Copy link

Same here, llama3:8b

@hawat
Copy link

hawat commented May 17, 2024

The same, using derivative from llama3:

GENERATION QUALITY WILL BE DEGRADED!
CONSIDER REGENERATING THE MODEL

@sub37
Copy link

sub37 commented May 18, 2024

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************

@vroomfondel
Copy link

vroomfondel commented May 25, 2024

coming from here: https://www.reddit.com/r/LocalLLaMA/comments/1cg0z1i/comment/l1su102/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

lead me to put the attached patch as "llm/patches/06-llama.cpp.diff" and then build ollama (trying to pass through the override-kv from llm/ext_server/server.cpp was a bit tedious since that override would be of type str and that is not handled in the linked version of llama.cpp [although upstream has a fix for it])

06-llama_cpp_RENAME_ME.txt

EDIT: Just saw, "llm/patches/05-default-pretokenizer.diff" in v0.1.39 does pretty much that (and more)
EDIT2: NEW
06-llama_cpp_NEW_RENAME_ME.txt
patch for v0.1.39 attached

@sealad886
Copy link
Author

The crux of the matter is: all models have to be re-converted and then re-quantized. You can dive into the issues/PRs I initially posted to learn more, but that's the super-short version.

Until the underlying llama.cpp base gets updated and then all of your models are re-converted, you might best be served by actually just doing pieces of it yourself. It's pretty straightforward, if you know a small amount of coding. I will note, as well, that if you build the ollama executable from the GitHub space, you do get a couple of cool features that aren't out yet (e.g. as of right now, Flash Attention).

Follow instructions here to learn how to import to Ollama from other formats (including from those available on Huggingface.io.

Run:

> ollama show --modelfile gemma:instruct   # <modelname> can be any model in your library/cache
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM gemma:instruct

FROM /Users/andrew/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77
TEMPLATE "<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"
PARAMETER penalize_newline false
PARAMETER repeat_penalty 1
PARAMETER stop <start_of_turn>
PARAMETER stop <end_of_turn>
LICENSE """Gemma Terms of Use 

Last modified: February 21, 2024

By using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.
<license truncated for ease of reading>

Copy that entire output in your favorite text editor (e.g. nano, vim...) and make a new file and call it literally whatever you want. I have a folder in my home directory that's just random modelfiles that I can use to import small changes quickly (I think that it's easier than having to ollama run <model> an then edit params, save, load the next one...but that might just be me).

Now you replace the first line of that file with a path to your converted GGUF file:

FROM /path/to/your/models/model.gguf
TEMPLATE "<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"
PARAMETER penalize_newline false
PARAMETER repeat_penalty 1
PARAMETER stop <start_of_turn>
PARAMETER stop <end_of_turn>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests