Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: If device layers requested exceed model layers, host layers overflow #329

Open
polarathene opened this issue May 19, 2024 · 11 comments
Labels
bug Something isn't working resolved

Comments

@polarathene
Copy link
Contributor

polarathene commented May 19, 2024

Describe the bug

If they number of device layers exceed the models, then the host layers to assign seems to wrap/overflow instead of the expected 0.

NOTE: With llama-cpp you can configure a larger number of layers and host layers will remain 0 while only the needed layers are used as device layers.

RUST_BACKTRACE=1 ./mistralrs-bench --num-device-layers 33 gguf -m . -t . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
# ...
INFO mistralrs_core::device_map: Using 33 layers on device and 18446744073709551615 on host.

thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: alloc::raw_vec::capacity_overflow
   3: <T as alloc::vec::spec_from_elem::SpecFromElem>::from_elem
   4: mistralrs_core::device_map::DeviceMapMetadata::into_mapper
   5: mistralrs_core::models::quantized_llama::ModelWeights::from_gguf
   6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   7: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   8: mistralrs_bench::main

Context:

  • Q4_KM GGUF model used, but this applies to any model where the layers is exact.
  • Since mistral.rs seems to enforce loading first through HuggingFace API calls, I've worked around that by allowing 401 (Unauthorized) panic as described here, while unlike llama-cpp additional config files is enforced... I sourced those from here, but it lacked a tokenizer.json file so I gave it one from another model (this has no relevance on the error encountered).

Additional feedback

mistral.rs output doesn't information like layers as clear to me as llama-cpp, and I don't know if there's some sort of inspect command to output/query the metadata?

I had thought it was 33 layers, but looking over the llama-cpp output again I see it's 32 with an extra layer appended afterwards:

ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.38 MiB
llm_load_tensors:      CUDA0 buffer size =  4095.16 MiB

I find this sort of information quite helpful, so if mistral.rs could communicate that better too that'd be nice 👍

Better communicating the device/GPU like above would also be nice vs what it currently displays:

mistralrs_bench: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...

Latest commit

v0.1.8: ca9bf7d

@polarathene polarathene added the bug Something isn't working label May 19, 2024
@polarathene
Copy link
Contributor Author

Bench OOMs (while llama-cpp doesn't)

Additionally, when trying with exact layer count, while these models had no issue with llama-cpp, they would OOM with mistral.rs. I was able to avoid that with lower layer count but the performance suffers notably? (anything from 28 layers or above was OOM for mistral.rs).

With 32 layers:

ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: Cuda(Cuda(DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory"))), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::quantized::cuda::QCudaStorage::dequantize_f16" }, { fn: "candle_core::quantized::QTensor::dequantize_f16" }, { fn: "candle_core::quantized::QMatMul::dequantize_f16" }, { fn: "candle_core::quantized::QMatMul::forward_via_f16" }, { fn: "mistralrs_core::layers::MatMul::qmatmul" }, { fn: "mistralrs_core::models::quantized_llama::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }] }
thread 'main' panicked at mistralrs-bench/src/main.rs:107:61:
internal error: entered unreachable code
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: mistralrs_bench::run_bench
   4: mistralrs_bench::main

Might be partially due to lack of batch? (-p 4096 is OOM on lower layers too, llama-cpp had no issue with -p 4096 -b 512 but it could also handle -b 4096 with 32 GPU layers too)

Performance vs llama-cpp

$ ./mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 24 gguf -m . -t . -f model.gguf

+-------+---------+--------+--------------+--------------+-------------+--------------+
| model | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+-------+---------+--------+--------------+--------------+-------------+--------------+
| .     | CUDA    | pp 512 | 39.129±0.000 | 25.557±0.000 |           1 |    39.128773 |
+-------+---------+--------+--------------+--------------+-------------+--------------+

vs llama-cpp:

$ ./llama-bench -m ./model.gguf -r 1 -n 0 -p 512 -b 512

| model                          |       size |     params | backend    | ngl |    n_batch |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ------------: | ---------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |        512 |         pp512 |   1248.13 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |        512 |   pp512+tg128 |    169.45 ± 0.00 |

# With matching GPU layers (default is 99):
$ ./llama-bench -m ./model.gguf -r 1 -n 0 -p 512 -b 512 --n-gpu-layers 24

| model                          |       size |     params | backend    | ngl |    n_batch |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ------------: | ---------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  24 |        512 |         pp512 |    940.59 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  24 |        512 |   pp512+tg128 |     81.55 ± 0.00 |

mistral.rs was built in the same Dockerfile as llama-cpp provides, only the cuda feature was enabled when building mistral.rs as flash attention was failing for some reason and cudnn wasn't part of the image.

No idea why performance is so vastly different, although it gets somewhat closer with pp512+tg128 (I have no clue what that is or why it's significantly slower vs pp512).

Not sure if the bench settings are equivalent, I referenced this example which used the above to compare and I only added the explicit GPU layers since this model OOMs with mistral.rs otherwise.

UPDATE: I observed that during the bench mistral.rs after warmup had 0% GPU usage. The memory was allocated according to nvidia-smi but it wasn't putting the GPU to work like llama-cpp was 🤔


Additional failure for Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

In this case, the model encountered another failure, but I assume it's because of the unrelated tokenizer.json I added, or one of the files from the transformers/fp16 model? (I didn't encounter this when trying other GGUF models like llama 8B above):

INFO mistralrs_core::device_map: Using 32 layers on device and 0 on host.
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:458:41:
called `Result::unwrap()` on an `Err` value: Error("duplicate field `clean_up_tokenization_spaces`", line: 290, column: 32)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   5: mistralrs_bench::main

@EricLBuehler
Copy link
Owner

Hi @polarathene!

Thank you for reporting this bug. I think that the UX part of device mapping isn't great currently, and hopefully #332 will improve that. Additionally, device mapping is not really optimized, and essentially consists of loading layers on different devices and then running them there. We copy the hidden states from device to host, which is probably the bottleneck as it requires a GPU-CPU synchronization.

mistral.rs was built in the same Dockerfile as llama-cpp provides, only the cuda feature was enabled when building mistral.rs as flash attention was failing for some reason and cudnn wasn't part of the image.

Can you please confirm that it prints it is loading on device and host (so we know it is loading on the GPU)? #332 should make that clearer.

Not sure if the bench settings are equivalent, I referenced #153 (comment) which used the above to compare and I only added the explicit GPU layers since this model OOMs with mistral.rs otherwise.

I think it should be. We don't have prompt chunking yet, and that's something I want to implement. The infrastructure is there, it was implemented by #242.

I observed that during the bench mistral.rs after warmup had 0% GPU usage. The memory was allocated according to nvidia-smi but it wasn't putting the GPU to work like llama-cpp was 🤔

I'm not sure why this is happening. If you use device mapping, it definitely uses the GPU for calculations, though. I'll take a look.

In this case, the model encountered another failure, but I assume it's because of the unrelated tokenizer.json I added, or one of the files from the transformers/fp16 model? (I didn't encounter this when trying other GGUF models like llama 8B above):

I checked the tokenizer_config.json file, and there is actually a mistake in there. It may be best to raise an issue with their repo, because they do indeed define it twice. In the meantime, if you download it locally, you can just remove one of them.

@EricLBuehler
Copy link
Owner

@polarathene , this should be fixed now.

@polarathene
Copy link
Contributor Author

Thanks, I'll give it a try within the next few days 👍

Can you please confirm that it prints it is loading on device and host (so we know it is loading on the GPU)? #332 should make that clearer.

I did share above this log line at the end of my last response:

INFO mistralrs_core::device_map: Using 32 layers on device and 0 on host.

This will need to wait until later, but I do know that my vRAM went up notably while the bench command was running, and as per the logs it had found the GPU just fine so that was all working. It just wasn't utilizing the GPU for the benching for some reason.

I'm not sure why this is happening. If you use device mapping, it definitely uses the GPU for calculations, though. I'll take a look.

It may have been something specific to the Docker or WSL2 environment, when I have the time I'll look into better reproducing and can verify with builds on WSL2 and the Windows 11 host to see if it still occurs in those environments too.

I checked the tokenizer_config.json file, and there is actually a mistake in there.

Oh ok, well if it hasn't already been tackled a more helpful error about that might be better UX? It wasn't clear that it was referring to a single file input.


If you don't need a response and consider this solved feel free to close, otherwise I'll do that later this week.

@EricLBuehler
Copy link
Owner

Oh ok, well if it hasn't already been tackled a more helpful error about that might be better UX? It wasn't clear that it was referring to a single file input.

Yes, this error can be improved. Generally, we should start to remove some of the .unwrap()s related to serialization.

I'll wait until you think it's good to close this issue.

@EricLBuehler
Copy link
Owner

@polarathene, does this overflow error still occur?

@polarathene
Copy link
Contributor Author

@polarathene, does this overflow error still occur?

The CLI still won't load files locally atm, I'd have to do a build again that bypasses the 401, but I think I'll confirm when the CLI works without that workaround, which is probably more meaningful 😅

@polarathene
Copy link
Contributor Author

@polarathene, does this overflow error still occur?

It fails when the given value exceeds the layers still, whereas it should just know it has enough and clip it to the amount needed? llama-cpp seems to take that approach I think.

$ /mist/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

2024-05-31T00:27:10.222222Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
2024-05-31T00:27:10.259228Z  INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32032, num added tokens: 0, num merges: 0, num scores: 32032
Error: Expected the number of GPU (40) and host layers (0) to sum to the number of model hidden layers (32)
   0: candle_core::error::Error::bt
   1: mistralrs_core::device_map::DeviceMapMetadata::into_mapper
   2: mistralrs_core::models::quantized_llama::ModelWeights::from_gguf
   3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   5: mistralrs_bench::main
   6: std::sys_common::backtrace::__rust_begin_short_backtrace
   7: std::rt::lang_start::{{closure}}
   8: std::rt::lang_start_internal
   9: main
  10: <unknown>
  11: __libc_start_main
  12: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_bench::main
   4: std::sys_common::backtrace::__rust_begin_short_backtrace
   5: std::rt::lang_start::{{closure}}
   6: std::rt::lang_start_internal
   7: main
   8: <unknown>
   9: __libc_start_main
  10: _start

It also seems to complain about the tokenizer_config.json, followed by tokenizer.json unless I use ... gguf -t /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/?

Is the bench utility not in sync with the server command for local GGUF loading with the tokenizer?

$ RUST_BACKTRACE=1 target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
File "tokenizer_config.json" not found at model id "."
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_bench::main

@polarathene
Copy link
Contributor Author

When providing exact layers it loads into vRAM, but then fails after hitting the earlier mentioned duplicate field for this model:

$ RUST_BACKTRACE=1 /mist/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 32 gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

2024-05-31T00:32:56.871358Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
2024-05-31T00:32:56.909242Z  INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32032, num added tokens: 0, num merges: 0, num scores: 32032
2024-05-31T00:32:57.056662Z  INFO mistralrs_core::device_map: Model has 32 repeating layers.
2024-05-31T00:32:57.056724Z  INFO mistralrs_core::device_map: Using 32 repeating layers on GPU and 0 repeating layers on host.
thread 'main' panicked at mistralrs-core/src/pipeline/mod.rs:1323:80:
called `Result::unwrap()` on an `Err` value: Error("duplicate field `clean_up_tokenization_spaces`", line: 290, column: 32)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: mistralrs_core::pipeline::get_chat_template
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   6: mistralrs_bench::main

After editing tokenizer_config.json to remove the 2nd occurence ("clean_up_tokenization_spaces": false,) it'll run:

$ /mist/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 32 gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

2024-05-31T01:08:36.913971Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
2024-05-31T01:08:36.942180Z  INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32032, num added tokens: 0, num merges: 0, num scores: 32032
2024-05-31T01:08:37.062847Z  INFO mistralrs_core::device_map: Model has 32 repeating layers.
2024-05-31T01:08:37.062905Z  INFO mistralrs_core::device_map: Using 32 repeating layers on GPU and 0 repeating layers on host.
2024-05-31T01:08:38.552757Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|im_end|>", "<|im_end|>", unk_tok = <unk>
2024-05-31T01:08:38.558662Z  INFO mistralrs_bench: Model loaded.
2024-05-31T01:08:38.575306Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-05-31T01:08:38.575453Z  INFO mistralrs_bench: Starting warmup run.
2024-05-31T01:08:38.876483Z  INFO mistralrs_bench: Finished warmup run.
2024-05-31T01:08:38.876528Z  INFO mistralrs_bench: Starting benchmarks.
+-------+---------+--------+----------------+-------------+-------------+--------------+
| model | backend | test   | t/s            | ms/t        | concurrency | throughput/s |
+-------+---------+--------+----------------+-------------+-------------+--------------+
| .     | CUDA    | pp 512 | 1064.449±0.000 | 0.939±0.000 |           1 |    1064.4491 |
+-------+---------+--------+----------------+-------------+-------------+--------------+

Result: 1064.44 (mistral-rs) vs 1248.13 (llama-cpp) token/sec 👍

The bench went quite quickly, I'm not quite sure how to compare it against a longer llama-cpp bench I did, but I think that's due to mistral.rs not being able to match some of these options:

llama-bench -m /tmp/model.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 64,128,256,512,1024,2048,4096

@EricLBuehler
Copy link
Owner

@polarathene I just merged #367 which clamps the number of device layers to the number of model layers. I think the functionality should work fully now?

@polarathene
Copy link
Contributor Author

polarathene commented Jun 1, 2024

I think the functionality should work fully now?

Yes, I can provide a higher number than needed and no failures now 👍


I did go down a rabbit hole thinking a regression was introduced due to performance being halved, but that seems to be a WSL2 quirk with weird memory management from when I tested models that my system couldn't handle 😓

NOTE: mistralrs-bench seems to still require the additional files for GGUF to load, unlike mistralrs-server?

  • I thought to try plain safetensors model, but this gave the 401 unauthorized, despite having other config files it tried to access HF for the local model name which was output.safetensors
  • I also noticed an issue with -m . that might need to be better documented (although technically shouldn't be relevant for local GGUF support, once this is addressed?):
  • A recent change introduced prevents using models from a read-only volume. mistral.rs will attempt to write a tokenizer.json file.
  • Even when using -t . for local tokenizer_config.json, if a chat_template is defined this is no longer detected and logged that one doesn't exist (at least when run via mistralrs-bench).

-m value needs to be absolute/relative path to the model location referenced in -f? (EDIT: Doesn't affect mistralrs-server, thus mistralrs-bench just needs the same local GGUF support):

$ target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:294:58:
File "tokenizer_config.json" not found at model id "."

# File does exist:
$ ls /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/

Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf  added_tokens.json  config.json  generation_config.json  special_tokens_map.json  tokenizer.model  tokenizer_config.json

Required:

# `-m` must point to directory of `-f`:
target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/ -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

# Unless you run the command from that location:
cd /models/Hermes-2-Pro-Mistral-7B.Q4_K_M
/app/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

Probably the same for -t if not using the new GGUF support that permits omitting -t.


Collapsed - Resolved temporary regression

UPDATE: While writing this response, I've no idea what changed but bench performance is back in line with prior results.

  • I can only assume it's due to WSL2 and Windows memory management, since prior to running this bench I had the system under 99% memory load a few minutes earlier.
  • I've noticed WSL2 reading any file data will actually keep memory allocated for a while even though it's only the file cache/buffer, and Windows seems to additional write to disk a similar amount of data that isn't released until it's free within WSL2 (but it's released slowly). Thus probably the cause of the performance skew..

Original response follows.

Additionally I'm not sure why, but the performance of the bench is half of what I reported previously:

$ mistralrs-bench -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf \
  -r 1 -c 1 -p 512 -g 0 gguf -m . -t .

+-------+---------+--------+-----------------+-------------+-------------+--------------+
| model | backend | test   | t/s             | ms/t        | concurrency | throughput/s |
+-------+---------+--------+-----------------+-------------+-------------+--------------+
| .     | CUDA    | pp 512 | 614.646±0.000   | 1.627±0.000 |           1 |     614.6458 |
+-------+---------+--------+-----------------+-------------+-------------+--------------+
$ llama-bench -m Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf  \
  -r 1 -p 512 -n 0 -b 512 -pg 0,0 -fa 0,1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 |        512 |          0 |         pp512 |   1313.10 ± 0.00 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 |        512 |          1 |         pp512 |   1344.06 ± 0.00 |

For reference llama-bench options and defaults (in case any implicit config is relevant):

options:
  -h, --help
  -m, --model <filename>              (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                  (default: 512)
  -n, --n-gen <n>                     (default: 128)
  -pg <pp,tg>                         (default: 512,128)
  -b, --batch-size <n>                (default: 2048)
  -ub, --ubatch-size <n>              (default: 512)
  -ctk, --cache-type-k <t>            (default: f16)
  -ctv, --cache-type-v <t>            (default: f16)
  -t, --threads <n>                   (default: 8)
  -ngl, --n-gpu-layers <n>            (default: 99)
  -sm, --split-mode <none|layer|row>  (default: layer)
  -mg, --main-gpu <i>                 (default: 0)
  -nkvo, --no-kv-offload <0|1>        (default: 0)
  -fa, --flash-attn <0|1>             (default: 0)
  -mmp, --mmap <0|1>                  (default: 1)
  --numa <distribute|isolate|numactl> (default: disabled)
  -embd, --embeddings <0|1>           (default: 0)
  -ts, --tensor-split <ts0/ts1/..>    (default: 0)
  -r, --repetitions <n>               (default: 5)
  -o, --output <csv|json|md|sql>      (default: md)
  -v, --verbose                       (default: 0)

vs current options for mistralrs-bench:

Options:
  -p, --n-prompt <N_PROMPT>
          Number of prompt tokens to run [default: 512]
  -g, --n-gen <N_GEN>
          Number of generations tokens to run [default: 128]
  -c, --concurrency <CONCURRENCY>
          Number of concurrent requests to run. Default is 1
  -r, --repetitions <REPETITIONS>
          Number of times to repeat each test [default: 5]
  -n, --num-device-layers <NUM_DEVICE_LAYERS>
          Number of device layers to load and run on the device. All others will be on the CPU

I tried to see if this was a regression by reverting back to an earlier commit, but prior to your commit for the GGUF tokenizer, I cannot load that Hermes model with tokenizer.json, or rather I had already discarded that and can't recall where I sourced a tokenizer.json that may have worked 🤷‍♂️

Author of the model states they don't know how to create tokenizer.json

Command + Additional info

All commands output the same model config lines, so I'll omit that from the output examples:

INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000

The command is the same for each with the exception of appending the -t . for a 2nd run. I use the absolute path to the command, and run the command from the model directory.

# Short version:
mistralrs-bench -n 32 -p 512 -g 0 -r 1 -c 1 gguf -m . -t . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

# Options expanded for clarity:
/app/target/release/mistralrs-bench \
    --num-device-layers 32 \
    --n-prompt 512 \
    --n-gen 0 \
    --repetitions 1 \
    --concurrency 1 \
    gguf -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf -m . -t .

Master branch (currently commit: Prepare to accept multiple model types)

Without -t:

INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32032, num added tokens: 0, num merges: 0, num scores: 32032

INFO mistralrs_core::device_map: Model has 32 repeating layers.
INFO mistralrs_core::device_map: Using 32 repeating layers on GPU and 0 repeating layers on host.
INFO mistralrs_core::pipeline::paths: `tokenizer_config.json` does not contain a chat template, attempting to use specified JINJA chat template.
INFO mistralrs_core::pipeline::paths: No specified chat template. No chat template will be used. Only prompts will be accepted, not messages.
INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|im_end|>", "<|im_end|>", unk_tok = <unk>
INFO mistralrs_bench: Model loaded.
INFO mistralrs_core: GEMM reduced precision in BF16 not supported.
INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
INFO mistralrs_bench: Starting warmup run.
INFO mistralrs_bench: Finished warmup run.
INFO mistralrs_bench: Starting benchmarks.

With -t .:

INFO mistralrs_core::device_map: Model has 32 repeating layers.
INFO mistralrs_core::device_map: Using 32 repeating layers on GPU and 0 repeating layers on host.
INFO mistralrs_core::pipeline::paths: `tokenizer_config.json` does not contain a chat template, attempting to use specified JINJA chat template.
INFO mistralrs_core::pipeline::paths: No specified chat template. No chat template will be used. Only prompts will be accepted, not messages.
INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|im_end|>", "<|eot_id|>", unk_tok = <unk>

thread 'main' panicked at mistralrs-core/src/pipeline/chat_template.rs:157:36:
Unable to extract `<|im_end|>` EOS token.

Slight difference on log line eos_toks = "<|im_end|>", "<|eot_id|>", also missing initial log line before "Model has 32 repeating layers" which I assume came from the default GGUF tokenizer commit.

tokenizer_config.json does contain a chat template though.. (the same content as what you had suggested previously to provide as a separate file for mistralrs-server --chat-template):

$ grep chat_template tokenizer_config.json

  "chat_template": "{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",

Before commit Allow default unigram unk token for GGUF

Built with commit Fix unauth check (#362).

Without -t:

thread 'main' panicked at mistralrs-core/src/pipeline/gguf_tokenizer.rs:65:31:
no entry found for key

stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_display
   3: core::option::expect_failed
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   6: mistralrs_bench::main

With -t .:

INFO mistralrs_core::device_map: Model has 32 repeating layers.
INFO mistralrs_core::device_map: Using 32 repeating layers on GPU and 0 repeating layers on host.
INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|im_end|>", "<|eot_id|>", unk_tok = <unk>

thread 'main' panicked at mistralrs-core/src/pipeline/chat_template.rs:155:36:
Unable to extract `<|im_end|>` EOS token.

Same failure as master branch, but no log about chat_template raised.

Side-note - Not compatible with read-only locations

I also noticed it does not like running from my docker read-only volume I used to access my models, as it apparently wants to try write something...? (EDIT: Detailed below)

For these reproductions, I thus made a copy of the model within the container since I can't opt-out of the enforced write operation.

# This is from master branch

thread 'main' panicked at mistralrs-core/src/utils/tokenizer.rs:40:47:
called `Result::unwrap()` on an `Err` value: Os { code: 30, kind: ReadOnlyFilesystem, message: "Read-only file system" }
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: mistralrs_core::utils::tokenizer::get_tokenizer
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   6: mistralrs_bench::main
Figured out the why (`added_tokens[].content` as a vocab mapping to it's associated `id`)

let raw_fixed = serde_json::to_vec_pretty(&tokenizer).unwrap();
std::fs::write(fixed_path, raw_fixed).unwrap();

let fixed_path = format!("{}_mistralrs_fixed", p.as_ref().display());
let fixed_path = Path::new(&fixed_path);

Seems to be applied unconditionally to satisfy a potential problem: huggingface/tokenizers#1528

Should probably be a CLI option, or even a subcommand to fix a local config if necessary?

  • I see that it's not run again if the file exists already (not sure how compatible that is when a model updates though?).
  • The filename is not handled properly, .json should be the extension, but presently it's tokenizer.json_mistralrs_fixed when it should instead be more like tokenizer.patched_by_mistralrs.json, or tokenizer.json_mistralrs_fixed.json (if you must keep the tokenizer.json prefix).

head -n 30 tokenizer.json:

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 0,
      "content": "<unk>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 1,
      "content": "<s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 2,
      "content": "</s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,

head -n 30 tokenizer.json_mistralrs_fixed:

{
  "added_tokens": [
    {
      "content": "<|begin_of_text|>",
      "id": 128000,
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    {
      "content": "<|end_of_text|>",
      "id": 128001,
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    {
      "content": "<|reserved_special_token_0|>",
      "id": 128002,
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    {
$ wc -l tokenizer.json tokenizer.json_mistralrs_fixed
   91121 tokenizer.json
  410759 tokenizer.json_mistralrs_fixed

$ jq '.added_tokens[].id' tokenizer.json
0
1
2

# Original ID not present:
$ jq '.added_tokens[].id' tokenizer.json_mistralrs_fixed
128000
128001
# ...
128255

# ID of the token with content '<unk>':
$ jq '.added_tokens[] | select(.content == "<unk>") | .id' tokenizer.json
0
# References:
$ grep '<unk>' tokenizer.json
      "content": "<unk>",
    "unk_token": "<unk>",
      "<unk>": 0,

# Token doesn't exist anymore (neither produce output):
$ jq '.added_tokens[] | select(.content == "<unk>") | .id' tokenizer.json_mistralrs_fixed
$ grep '<unk>' tokenizer.json_mistralrs_fixed

So I'm not quite sure what this file with similar name is doing, is it intended to remove data that was there?

I tried a tokenizer.json from a llama 3 based model with an older commit before the logic above was implemented, even though it was not compatible with the Hermes model and resulted in the same failure, it did output the warning log lines:

WARN tokenizers::tokenizer::serialization: Warning: Token '<|begin_of_text|>' was expected to have ID '128000' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|end_of_text|>' was expected to have ID '128001' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_0|>' was expected to have ID '128002' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_1|>' was expected to have ID '128003' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_2|>' was expected to have ID '128004' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_3|>' was expected to have ID '128005' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|start_header_id|>' was expected to have ID '128006' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|end_header_id|>' was expected to have ID '128007' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_4|>' was expected to have ID '128008' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|eot_id|>' was expected to have ID '128009' but was given ID 'None'

Which makes sense for the error now I guess 😅 The tokenizer.json files I've been randomly trying from HF had the other mentioned tokens (bos_tokens / unk_token) but since im_end wasn't present the failure shouldn't be a surprise 😑

# No output
$ grep 'im_end' tokenizer.json

So on the tokenizer.json that was emitting those errors we can lookup that entry for the last warning of <|eot_id|>, the id field is what was expected, not sure why the warning says it got None 🤔

$ jq '.added_tokens[] | select(.content == "<|eot_id|>") | .' tokenizer.json
{
  "id": 128009,
  "content": "<|eot_id|>",
  "single_word": false,
  "lstrip": false,
  "rstrip": false,
  "normalized": false,
  "special": true
}

EDIT: I should have read the utils/tokenizer.rs code, mapping of vocab keys as added_tokens[].content to their respective id was missing:

for token in added_tokens {
if !vocab.contains_key(&token.content) {
tokenizer["model"]["vocab"]
.as_object_mut()
.unwrap()
.insert(token.content, token.id.into())
.ok_or(())
.unwrap_err();
}
}

I still don't think it should be needing to enforce the write as a way to cache though?

The token in this case perhaps due to different layout with tokenizer.model instead of tokenizer.json had the required tokens defined in a separate added_tokens.json file. Presumably when constructing a tokenizer.json it'd get the tokens from that?

{
  "<|im_end|>": 32000,
  "<|im_start|>": 32001,
  // ...
}

Git history on master branch

I went back a bit further to when I would get the 401 unauth issue, and applied my single line opt-out workaround I shared at the end of this comment.

At this point I realized that I was selecting a commit that wasn't technically part of the master branch, it was interleaved with commits from other branches that were merged into master. So navigating the branch history via Github's web UI to select commits isn't that helpful 😓

So the main branch history has stuff like this (commits that are noisy in history, rather than meaningful):

image

EDIT: Seems like you've recently switched to using squash merge since #362 😎

Anyway, I went further back before gguf_tokenizer.rs was introduced and still had issues with tokenizer.json, so I must have found a some-what compatible one somewhere on HF previously, but I can't recall where I got it from 🤦‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolved
Projects
None yet
Development

No branches or pull requests

2 participants