Add BPE pre-tokenization for Command-R. #7033

dranger003 · 2024-05-02T01:21:22Z

I read #6920 and 120cf37 and attempting to add support Command-R support.

./build/bin/test-tokenizer-0 models/ggml-vocab-command-r.gguf
...
Tests passed

EDIT: I also tested Command-R+ successfully using this PR.

github-actions · 2024-05-02T02:25:30Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8515.43ms p(95)=20551.35ms fails=, finish reason: stop=489 truncated=61
Prompt processing (pp): avg=99.05tk/s p(95)=446.8tk/s
Token generation (tg): avg=33.63tk/s p(95)=45.35tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=bpe-pretok-command-r commit=9cbad1b2cf4852fc6cd7ff8eab3c41734cea6e07

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 570.5, 570.5, 570.5, 570.5, 570.5, 772.11, 772.11, 772.11, 772.11, 772.11, 666.45, 666.45, 666.45, 666.45, 666.45, 712.62, 712.62, 712.62, 712.62, 712.62, 767.81, 767.81, 767.81, 767.81, 767.81, 778.27, 778.27, 778.27, 778.27, 778.27, 778.55, 778.55, 778.55, 778.55, 778.55, 811.87, 811.87, 811.87, 811.87, 811.87, 810.2, 810.2, 810.2, 810.2, 810.2, 812.02, 812.02, 812.02, 812.02, 812.02, 838.12, 838.12, 838.12, 838.12, 838.12, 868.27, 868.27, 868.27, 868.27, 868.27, 894.9, 894.9, 894.9, 894.9, 894.9, 835.57, 835.57, 835.57, 835.57, 835.57, 842.47, 842.47, 842.47, 842.47, 842.47, 842.57, 842.57, 842.57, 842.57, 842.57, 840.07, 840.07, 840.07, 840.07, 840.07, 813.54, 813.54, 813.54, 813.54, 813.54, 812.52, 812.52, 812.52, 812.52, 812.52, 812.27, 812.27, 812.27, 812.27, 812.27, 820.03, 820.03, 820.03, 820.03, 820.03, 816.47, 816.47, 816.47, 816.47, 816.47, 828.49, 828.49, 828.49, 828.49, 828.49, 833.13, 833.13, 833.13, 833.13, 833.13, 828.21, 828.21, 828.21, 828.21, 828.21, 827.08, 827.08, 827.08, 827.08, 827.08, 824.19, 824.19, 824.19, 824.19, 824.19, 823.88, 823.88, 823.88, 823.88, 823.88, 823.42, 823.42, 823.42, 823.42, 823.42, 828.42, 828.42, 828.42, 828.42, 828.42, 829.83, 829.83, 829.83, 829.83, 829.83, 827.81, 827.81, 827.81, 827.81, 827.81, 832.7, 832.7, 832.7, 832.7, 832.7, 844.49, 844.49, 844.49, 844.49, 844.49, 849.91, 849.91, 849.91, 849.91, 849.91, 858.9, 858.9, 858.9, 858.9, 858.9, 858.45, 858.45, 858.45, 858.45, 858.45, 855.88, 855.88, 855.88, 855.88, 855.88, 858.04, 858.04, 858.04, 858.04, 858.04, 859.34, 859.34, 859.34, 859.34, 859.34, 858.77, 858.77, 858.77, 858.77, 858.77, 865.77, 865.77, 865.77, 865.77, 865.77, 847.64, 847.64, 847.64, 847.64, 847.64, 846.71, 846.71, 846.71, 846.71, 846.71, 845.25, 845.25, 845.25, 845.25, 845.25, 845.64, 845.64, 845.64, 845.64, 845.64, 852.07, 852.07, 852.07, 852.07, 852.07, 851.57, 851.57, 851.57, 851.57, 851.57, 857.48, 857.48, 857.48, 857.48, 857.48, 856.26, 856.26, 856.26, 856.26, 856.26, 858.65, 858.65, 858.65, 858.65, 858.65, 862.37, 862.37, 862.37, 862.37, 862.37, 861.33, 861.33, 861.33, 861.33, 861.33, 866.91, 866.91, 866.91, 866.91, 866.91, 868.21, 868.21, 868.21, 868.21, 868.21, 867.7, 867.7, 867.7, 867.7, 867.7, 868.47, 868.47, 868.47, 868.47, 868.47, 867.71, 867.71, 867.71, 867.71, 867.71, 870.35, 870.35, 870.35, 870.35, 870.35, 872.6, 872.6, 872.6, 872.6, 872.6, 871.26, 871.26, 871.26, 871.26]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.05, 36.05, 36.05, 36.05, 36.05, 37.32, 37.32, 37.32, 37.32, 37.32, 31.76, 31.76, 31.76, 31.76, 31.76, 29.58, 29.58, 29.58, 29.58, 29.58, 30.78, 30.78, 30.78, 30.78, 30.78, 31.33, 31.33, 31.33, 31.33, 31.33, 32.93, 32.93, 32.93, 32.93, 32.93, 34.07, 34.07, 34.07, 34.07, 34.07, 34.38, 34.38, 34.38, 34.38, 34.38, 34.76, 34.76, 34.76, 34.76, 34.76, 34.67, 34.67, 34.67, 34.67, 34.67, 34.46, 34.46, 34.46, 34.46, 34.46, 33.72, 33.72, 33.72, 33.72, 33.72, 32.94, 32.94, 32.94, 32.94, 32.94, 32.75, 32.75, 32.75, 32.75, 32.75, 32.54, 32.54, 32.54, 32.54, 32.54, 32.52, 32.52, 32.52, 32.52, 32.52, 32.53, 32.53, 32.53, 32.53, 32.53, 31.94, 31.94, 31.94, 31.94, 31.94, 31.55, 31.55, 31.55, 31.55, 31.55, 31.46, 31.46, 31.46, 31.46, 31.46, 31.57, 31.57, 31.57, 31.57, 31.57, 31.69, 31.69, 31.69, 31.69, 31.69, 31.33, 31.33, 31.33, 31.33, 31.33, 31.53, 31.53, 31.53, 31.53, 31.53, 31.76, 31.76, 31.76, 31.76, 31.76, 31.65, 31.65, 31.65, 31.65, 31.65, 31.11, 31.11, 31.11, 31.11, 31.11, 31.18, 31.18, 31.18, 31.18, 31.18, 31.37, 31.37, 31.37, 31.37, 31.37, 31.45, 31.45, 31.45, 31.45, 31.45, 31.59, 31.59, 31.59, 31.59, 31.59, 31.72, 31.72, 31.72, 31.72, 31.72, 31.58, 31.58, 31.58, 31.58, 31.58, 31.57, 31.57, 31.57, 31.57, 31.57, 31.35, 31.35, 31.35, 31.35, 31.35, 30.99, 30.99, 30.99, 30.99, 30.99, 31.11, 31.11, 31.11, 31.11, 31.11, 31.18, 31.18, 31.18, 31.18, 31.18, 31.32, 31.32, 31.32, 31.32, 31.32, 31.42, 31.42, 31.42, 31.42, 31.42, 31.31, 31.31, 31.31, 31.31, 31.31, 31.18, 31.18, 31.18, 31.18, 31.18, 30.64, 30.64, 30.64, 30.64, 30.64, 30.2, 30.2, 30.2, 30.2, 30.2, 29.78, 29.78, 29.78, 29.78, 29.78, 29.7, 29.7, 29.7, 29.7, 29.7, 29.78, 29.78, 29.78, 29.78, 29.78, 29.87, 29.87, 29.87, 29.87, 29.87, 29.91, 29.91, 29.91, 29.91, 29.91, 30.01, 30.01, 30.01, 30.01, 30.01, 30.04, 30.04, 30.04, 30.04, 30.04, 29.86, 29.86, 29.86, 29.86, 29.86, 29.7, 29.7, 29.7, 29.7, 29.7, 29.83, 29.83, 29.83, 29.83, 29.83, 29.93, 29.93, 29.93, 29.93, 29.93, 30.06, 30.06, 30.06, 30.06, 30.06, 30.2, 30.2, 30.2, 30.2, 30.2, 30.25, 30.25, 30.25, 30.25, 30.25, 30.27, 30.27, 30.27, 30.27, 30.27, 30.31, 30.31, 30.31, 30.31]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.41, 0.41, 0.41, 0.41, 0.41, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.33, 0.33, 0.33, 0.33, 0.33, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.33, 0.33, 0.33, 0.33, 0.33, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.45, 0.45, 0.45, 0.45, 0.45, 0.5, 0.5, 0.5, 0.5, 0.5, 0.45, 0.45, 0.45, 0.45, 0.45, 0.41, 0.41, 0.41, 0.41, 0.41, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0]

ggerganov · 2024-05-02T05:20:57Z

I tried to run the convert-hf-to-gguf-update.py script but it failed because tokenizer.json is LFS in that repo. So we need to handle it:

diff --git a/convert-hf-to-gguf-update.py b/convert-hf-to-gguf-update.py
index f4774003..4ad4d867 100644
--- a/convert-hf-to-gguf-update.py
+++ b/convert-hf-to-gguf-update.py
@@ -95,6 +95,14 @@ for model in models:
     save_path = f"models/tokenizers/{name}/tokenizer.json"
     download_file_with_auth(url, token, save_path)
 
+    # if downloaded file is less than 1KB, we likely need to download an LFS instead
+    if os.path.getsize(save_path) < 1024:
+        # remove the file
+        os.remove(save_path)
+        url = f"{repo}/resolve/main/tokenizer.json"
+        save_path = f"models/tokenizers/{name}/tokenizer.json"
+        download_file_with_auth(url, token, save_path)
+
     if tokt == TOKENIZER_TYPE.SPM:
         url = f"{repo}/resolve/main/tokenizer.model"
         save_path = f"models/tokenizers/{name}/tokenizer.model"

drummerv · 2024-05-02T05:27:25Z

Lets goooooO!

sealad886

I've had an a-ha moment, and these look good. I've run it and got the same.

Optional: add in command-r-plus with nearly identical footprint, since the pre-tokenizers are the same (couple ways to make that happen).

One note: tokenizer.config specifies the Digits pre-tokenizer is used and individual_digits=True. There's not a regex pattern in llama.cpp explicitly matching individual digits. EVen though the tests pass, would that be a possibilty to add in for completeness sake?

Rotatingxenomorph · 2024-05-02T08:18:34Z

I get why it's changed, but does it mean the old quants are broken even when using the older versions of llamacpp? I'm not sure I can face downloading and quantizing it myself.

sealad886 · 2024-05-02T08:31:25Z

I get why it's changed, but does it mean the old quants are broken even when using the older versions of llamacpp? I'm not sure I can face downloading and quantizing it myself.

My understanding of the overall issue (which admittedly is not as nuanced as others') would suggest that, for the most part, yes you should re-download and re-convert and re-quantize your models. The longer version is that newer models are more likely to use newer pre-tokenizer splits.

Luckily, command-r and command-r-plus appear to be nearly identical to the default pre-tokenizer's splits, so for these models it's probably okay?

Rotatingxenomorph · 2024-05-02T08:35:10Z

I get why it's changed, but does it mean the old quants are broken even when using the older versions of llamacpp? I'm not sure I can face downloading and quantizing it myself.

My understanding of the overall issue (which admittedly is not as nuanced as others') would suggest that, for the most part, yes you should re-download and re-convert and re-quantize your models. The longer version is that newer models are more likely to use newer pre-tokenizer splits.

Luckily, command-r and command-r-plus appear to be nearly identical to the default pre-tokenizer's splits, so for these models it's probably okay?

I just remembered that the imat quants will definitely be incorrect for the new version of llamacpp. I have a non-imat q6k that I was hoping to use until some kind person makes a new q8 for huggingface ;)

sealad886 · 2024-05-02T08:45:52Z

I just remembered that the imat quants will definitely be incorrect for the new version of llamacpp. I have a non-imat q6k that I was hoping to use until some kind person makes a new q8 for huggingface ;)

Yeah exactly...lots of componding issues. I've ended up moving some of these over to Ollama as well, so now I have to untangle what came from me and what I pulled from them. But I wouldn't be surprised if the majority of Ollama models have this same issue and will need to be re-built.

sealad886

Actually, yeah. Let's just approve this to get it in there, and I'll add in command-r-plus separately.

dranger003 · 2024-05-02T11:11:39Z

This PR works for both Command-R version, since they have the same hash.

dranger003 · 2024-05-02T11:17:46Z

I tried to run the convert-hf-to-gguf-update.py script but it failed because tokenizer.json is LFS in that repo. So we need to handle it:

diff --git a/convert-hf-to-gguf-update.py b/convert-hf-to-gguf-update.py
index f4774003..4ad4d867 100644
--- a/convert-hf-to-gguf-update.py
+++ b/convert-hf-to-gguf-update.py
@@ -95,6 +95,14 @@ for model in models:
     save_path = f"models/tokenizers/{name}/tokenizer.json"
     download_file_with_auth(url, token, save_path)
 
+    # if downloaded file is less than 1KB, we likely need to download an LFS instead
+    if os.path.getsize(save_path) < 1024:
+        # remove the file
+        os.remove(save_path)
+        url = f"{repo}/resolve/main/tokenizer.json"
+        save_path = f"models/tokenizers/{name}/tokenizer.json"
+        download_file_with_auth(url, token, save_path)
+
     if tokt == TOKENIZER_TYPE.SPM:
         url = f"{repo}/resolve/main/tokenizer.model"
         save_path = f"models/tokenizers/{name}/tokenizer.model"

Thanks, I added this update to the PR.

drummerv · 2024-05-02T14:54:13Z

python3 ./llama.cpp/convert-hf-to-gguf-update.py hf_token
Directory models/tokenizers/llama-spm already exists - skipping
Downloading llama-bpe to models/tokenizers/llama-bpe
Failed to download file. Status code: 403
Failed to download file. Status code: 403
Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert-hf-to-gguf-update.py", line 99, in <module>
    if os.path.getsize(save_path) < 1024:
  File "/usr/lib/python3.10/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: 'models/tokenizers/llama-bpe/tokenizer.json'
root@6efd24cdcbac:/workspace#

had to run this several times (it'll create the folder and then error out it seems)

File models/tokenizers/command-r/tokenizer_config.json downloaded successfully
Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert-hf-to-gguf-update.py", line 127, in <module>
    from transformers import AutoTokenizer
ModuleNotFoundError: No module named 'transformers'

i think cmd-r was already done at this point but i just wanted to point this out.

Directory models/tokenizers/command-r already exists - skipping
Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert-hf-to-gguf-update.py", line 128, in <module>
    tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 369, in cached_file
    raise EnvironmentError(
OSError: models/tokenizers/llama-bpe does not appear to have a file named config.json. Checkout 'https://huggingface.co/models/tokenizers/llama-bpe/tree/None' for available files.

blocked, but cmd-r is done anyway

Got to convert Command R 35B into a f16 GGUF. I'm currently quantizing them and I'll test out Q5KM soon.

dranger003 · 2024-05-02T20:05:26Z

@drummerv I am not getting any errors.

convert-hf-to-gguf-update.py

$ python ./convert-hf-to-gguf-update.py hf_token
Downloading llama-spm to models/tokenizers/llama-spm
File models/tokenizers/llama-spm/config.json downloaded successfully
File models/tokenizers/llama-spm/tokenizer.json downloaded successfully
File models/tokenizers/llama-spm/tokenizer.model downloaded successfully
File models/tokenizers/llama-spm/tokenizer_config.json downloaded successfully
Downloading llama-bpe to models/tokenizers/llama-bpe
File models/tokenizers/llama-bpe/config.json downloaded successfully
File models/tokenizers/llama-bpe/tokenizer.json downloaded successfully
File models/tokenizers/llama-bpe/tokenizer_config.json downloaded successfully
Downloading phi-3 to models/tokenizers/phi-3
File models/tokenizers/phi-3/config.json downloaded successfully
File models/tokenizers/phi-3/tokenizer.json downloaded successfully
File models/tokenizers/phi-3/tokenizer.model downloaded successfully
File models/tokenizers/phi-3/tokenizer_config.json downloaded successfully
Downloading deepseek-llm to models/tokenizers/deepseek-llm
File models/tokenizers/deepseek-llm/config.json downloaded successfully
File models/tokenizers/deepseek-llm/tokenizer.json downloaded successfully
File models/tokenizers/deepseek-llm/tokenizer_config.json downloaded successfully
Downloading deepseek-coder to models/tokenizers/deepseek-coder
File models/tokenizers/deepseek-coder/config.json downloaded successfully
File models/tokenizers/deepseek-coder/tokenizer.json downloaded successfully
File models/tokenizers/deepseek-coder/tokenizer_config.json downloaded successfully
Downloading falcon to models/tokenizers/falcon
File models/tokenizers/falcon/config.json downloaded successfully
File models/tokenizers/falcon/tokenizer.json downloaded successfully
File models/tokenizers/falcon/tokenizer_config.json downloaded successfully
Downloading bert-bge to models/tokenizers/bert-bge
File models/tokenizers/bert-bge/config.json downloaded successfully
File models/tokenizers/bert-bge/tokenizer.json downloaded successfully
File models/tokenizers/bert-bge/tokenizer_config.json downloaded successfully
Downloading mpt to models/tokenizers/mpt
File models/tokenizers/mpt/config.json downloaded successfully
File models/tokenizers/mpt/tokenizer.json downloaded successfully
File models/tokenizers/mpt/tokenizer_config.json downloaded successfully
Downloading starcoder to models/tokenizers/starcoder
File models/tokenizers/starcoder/config.json downloaded successfully
File models/tokenizers/starcoder/tokenizer.json downloaded successfully
File models/tokenizers/starcoder/tokenizer_config.json downloaded successfully
Downloading gpt-2 to models/tokenizers/gpt-2
File models/tokenizers/gpt-2/config.json downloaded successfully
File models/tokenizers/gpt-2/tokenizer.json downloaded successfully
File models/tokenizers/gpt-2/tokenizer_config.json downloaded successfully
Downloading command-r to models/tokenizers/command-r
File models/tokenizers/command-r/config.json downloaded successfully
File models/tokenizers/command-r/tokenizer.json downloaded successfully
File models/tokenizers/command-r/tokenizer.json downloaded successfully
File models/tokenizers/command-r/tokenizer_config.json downloaded successfully
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: llama-bpe
tokt: 2
repo: https://huggingface.co/meta-llama/Meta-Llama-3-8B
chktok: [128000, 198, 4815, 15073, 66597, 8004, 1602, 2355, 79772, 11187, 9468, 248, 222, 320, 8416, 8, 27623, 114, 102470, 9468, 234, 104, 31643, 320, 36773, 100166, 98634, 8, 26602, 227, 11410, 99, 247, 9468, 99, 247, 220, 18, 220, 1644, 220, 8765, 220, 8765, 18, 220, 8765, 1644, 220, 8765, 8765, 220, 8765, 8765, 18, 220, 8765, 8765, 1644, 220, 18, 13, 18, 220, 18, 497, 18, 220, 18, 1131, 18, 220, 21549, 222, 98629, 241, 45358, 233, 21549, 237, 45358, 224, 21549, 244, 21549, 115, 21549, 253, 45358, 223, 21549, 253, 21549, 95, 98629, 227, 76460, 223, 949, 37046, 101067, 19000, 23182, 102301, 9263, 18136, 16, 36827, 21909, 56560, 54337, 19175, 102118, 13373, 64571, 34694, 3114, 112203, 80112, 3436, 106451, 14196, 14196, 74694, 3089, 3089, 29249, 17523, 3001, 27708, 7801, 358, 3077, 1027, 364, 83, 820, 568, 596, 1070, 11, 364, 793, 499, 2771, 30, 364, 44, 539, 2771, 358, 3358, 1304, 433, 11, 364, 35, 499, 1093, 1063, 15600, 30, 1226, 6, 43712, 264, 64966, 43]
chkhsh: 0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Split",
            "pattern": {
                "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": false
        }
    ]
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: deepseek-llm
tokt: 2
repo: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base
chktok: [100000, 185, 207, 185, 185, 207, 185, 185, 185, 207, 11969, 486, 22504, 185, 243, 185, 300, 185, 251, 185, 663, 185, 10044, 95300, 334, 8754, 8, 33701, 114, 350, 222, 10044, 221, 104, 46713, 334, 34732, 996, 24250, 262, 80923, 8, 207, 37103, 214, 12356, 99, 234, 10044, 99, 234, 207, 18, 207, 18, 18, 207, 18, 18, 18, 207, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 18, 207, 18, 13, 18, 207, 18, 526, 18, 207, 18, 1204, 18, 207, 71374, 209, 71374, 114, 71374, 228, 155, 240, 220, 71374, 224, 155, 240, 211, 71374, 231, 71374, 115, 71374, 240, 155, 240, 210, 71374, 240, 71374, 95, 71374, 114, 71374, 214, 71899, 210, 3025, 19017, 612, 9407, 2681, 16, 18, 16, 19, 16, 20, 16, 1398, 68940, 239, 78827, 55170, 76659, 620, 91754, 31116, 36804, 4885, 4885, 10897, 4390, 4390, 41047, 15278, 3033, 14986, 5675, 304, 6, 313, 803, 655, 33326, 362, 6, 82, 745, 11, 655, 1374, 340, 2049, 30, 655, 44, 441, 2049, 304, 6, 647, 1099, 359, 11, 655, 35, 340, 837, 742, 10842, 30, 1003, 6, 10699, 245, 6, 75, 43]
chkhsh: 049ecf7629871e3041641907f3de7c733e4dbfdc736f57d882ba0b0845599754
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\r\n]"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?[A-Za-z\u00b5\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u01ba\u01bc-\u01bf\u01c4-\u0293\u0295-\u02af\u0370-\u0373\u0376\u0377\u037b-\u037d\u037f\u0386\u0388-\u038a\u038c\u038e-\u03a1\u03a3-\u03f5\u03f7-\u0481\u048a-\u052f\u0531-\u0556\u10a0-\u10c5\u13a0-\u13f5\u13f8-\u13fd\u1c90-\u1cba\u1cbd-\u1cbf\u1d00-\u1d2b\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1f15\u1f18-\u1f1d\u1f20-\u1f45\u1f48-\u1f4d\u1f50-\u1f57\u1f59\u1f5b\u1f5d\u1f5f-\u1f7d\u1f80-\u1fb4\u1fb6-\u1fbc\u1fbe\u1fc2-\u1fc4\u1fc6-\u1fcc\u1fd0-\u1fd3\u1fd6-\u1fdb\u1fe0-\u1fec\u1ff2-\u1ff4\u1ff6-\u1ffc\u2102\u2107\u210a-\u2113\u2115\u2119-\u211d\u2124\u2126\u2128\u212a-\u212d\u212f-\u2134\u2139\u213c-\u213f\u2145-\u2149\u214e\u2183\u2184\u2c00-\u2c7b\u2c7e-\u2ce4\u2ceb-\u2cee\u2cf2\u2cf3\ua640-\ua66d\ua680-\ua69b\ua722-\ua76f\ua771-\ua787\ua78b-\ua78e\uab70-\uabbf\ufb00-\ufb06\ufb13-\ufb17\uff21-\uff3a\uff41-\uff5a\ud801\udc00-\ud801\udc4f\ud801\udcb0-\ud801\udcd3\ud801\udcd8-\ud801\udcfb\ud803\udc80-\ud803\udcb2\ud803\udcc0-\ud803\udcf2\ud806\udca0-\ud806\udcdf\ud83a\udd00-\ud83a\udd43]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?[!-/:-~\uff01-\uff0f\uff1a-\uff5e\u2018-\u201f\u3000-\u3002]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s+$"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\u4e00-\u9fa5\u0800-\u4e00\uac00-\ud7ff]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": false
        }
    ]
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: deepseek-coder
tokt: 2
repo: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
chktok: [32013, 185, 207, 185, 185, 207, 185, 185, 185, 207, 12405, 459, 22758, 185, 243, 185, 315, 185, 251, 185, 730, 185, 10047, 235, 209, 334, 8760, 8, 12394, 233, 114, 350, 222, 10047, 221, 104, 169, 116, 224, 334, 4684, 3909, 992, 24330, 262, 29651, 612, 8, 207, 156, 237, 214, 12394, 99, 234, 10047, 99, 234, 207, 18, 207, 18, 18, 207, 18, 18, 18, 207, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 18, 207, 18, 13, 18, 207, 18, 524, 18, 207, 18, 1202, 18, 207, 155, 239, 209, 155, 239, 114, 155, 239, 228, 155, 240, 220, 155, 239, 224, 155, 240, 211, 155, 239, 231, 155, 239, 115, 155, 239, 240, 155, 240, 210, 155, 239, 240, 155, 239, 95, 155, 239, 114, 155, 239, 214, 10047, 233, 210, 3015, 19100, 608, 9413, 2668, 16, 18, 16, 19, 16, 20, 16, 1393, 169, 121, 239, 18155, 374, 17194, 28, 2861, 6478, 616, 2251, 14994, 31269, 4191, 6, 4686, 4686, 10252, 3358, 3358, 3409, 524, 15330, 3023, 15031, 5668, 303, 6, 312, 798, 651, 83, 839, 362, 6, 82, 741, 11, 651, 1369, 340, 2037, 30, 651, 44, 441, 2037, 303, 6, 642, 1098, 359, 11, 651, 35, 340, 833, 738, 10860, 30, 998, 6, 10709, 245, 6, 75, 43]
chkhsh: 347715f544604f9118bb75ed199f68779f423cabb20db6de6f31b908d04d7821
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\r\n]"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?\\p{L}+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?\\p{P}+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\u4e00-\u9fa5\u0800-\u4e00\uac00-\ud7ff]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": false
        }
    ]
}


model: falcon
tokt: 2
repo: https://huggingface.co/tiiuae/falcon-7b
chktok: [1212, 4824, 1001, 1212, 192, 204, 663, 49453, 2069, 742, 561, 1501, 193, 2571, 232, 206, 204, 19, 11003, 20, 8196, 126, 283, 219, 48778, 116, 13392, 204, 19, 51831, 732, 63209, 1741, 7955, 522, 20, 22438, 211, 3346, 111, 231, 2571, 111, 231, 204, 30, 204, 3138, 204, 22287, 204, 22287, 30, 204, 22287, 3138, 204, 22287, 22287, 204, 22287, 22287, 30, 204, 22287, 22287, 3138, 204, 30, 25, 30, 204, 30, 513, 30, 204, 30, 951, 30, 27171, 236, 206, 38154, 126, 38154, 225, 167, 237, 217, 38154, 221, 167, 237, 208, 38154, 228, 38154, 127, 38154, 237, 167, 237, 207, 38154, 237, 38154, 107, 38154, 126, 38154, 211, 20589, 207, 204, 42, 50087, 123, 2727, 20300, 32022, 133, 234, 17419, 30137, 28, 7858, 181, 133, 236, 204, 37057, 2228, 10666, 5052, 133, 6207, 151, 215, 150, 134, 5052, 133, 6279, 5052, 223, 151, 216, 49679, 123, 53110, 47043, 7795, 204, 7544, 7544, 7544, 8543, 8543, 17593, 3513, 3513, 12844, 51520, 17664, 4247, 295, 18, 298, 650, 204, 18, 95, 693, 332, 18, 94, 629, 23, 204, 18, 1553, 299, 1310, 42, 204, 18, 56, 416, 1310, 295, 18, 567, 717, 334, 23, 204, 18, 47, 299, 606, 596, 6696, 42, 703, 18, 16139, 241, 18, 87, 55]
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Punctuation",
            "behavior": "Contiguous"
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": true
        },
        {
            "type": "Digits",
            "individual_digits": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "[0-9][0-9][0-9]"
            },
            "behavior": "Isolated",
            "invert": false
        }
    ]
}


model: bert-bge
tokt: 3
repo: https://huggingface.co/BAAI/bge-small-en-v1.5
chktok: [101, 100, 1006, 3671, 1007, 100, 1006, 3674, 7861, 29147, 2483, 9530, 16280, 23854, 1007, 100, 100, 1017, 3943, 21211, 21211, 2509, 21211, 22394, 21211, 22394, 2509, 21211, 22394, 22394, 21211, 22394, 22394, 2509, 1017, 1012, 1017, 1017, 1012, 1012, 1017, 1017, 1012, 1012, 1012, 1017, 100, 1029, 1855, 100, 100, 6207, 100, 100, 14677, 23632, 22203, 1811, 1995, 1011, 1011, 1011, 1011, 1011, 1011, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1192, 15290, 29754, 14150, 1192, 10260, 1181, 29755, 29436, 29741, 10260, 16856, 29747, 23925, 10325, 1005, 1005, 1005, 1005, 1005, 1005, 1036, 1036, 1036, 1036, 1036, 1036, 1036, 1000, 1000, 1000, 1000, 1012, 1012, 1012, 1012, 1012, 1012, 999, 999, 999, 999, 999, 999, 1029, 1029, 1029, 1029, 1029, 1029, 1045, 1005, 2310, 2042, 1005, 2409, 2002, 1005, 1055, 2045, 1010, 1005, 2128, 2017, 2469, 1029, 1005, 1049, 2025, 2469, 1045, 1005, 2222, 2191, 2009, 1010, 1005, 1040, 2017, 2066, 2070, 5572, 1029, 2057, 1005, 2310, 1037, 1005, 2222, 102]
chkhsh: 0876d13b50744004aa9aeae05e7b0647eac9d801b5ba4668afc01e709c15e19f
pre_tokenizer: {
    "type": "BertPreTokenizer"
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: mpt
tokt: 2
repo: https://huggingface.co/mosaicml/mpt-7b
chktok: [586, 1744, 33525, 186, 209, 623, 28910, 187, 50276, 187, 50275, 187, 50274, 187, 50273, 187, 14931, 237, 211, 313, 6320, 10, 49042, 116, 325, 224, 14931, 223, 106, 171, 118, 226, 313, 34263, 802, 13511, 261, 32147, 456, 10, 3384, 239, 216, 22692, 101, 236, 14931, 101, 236, 495, 5922, 30057, 495, 20084, 495, 26409, 30057, 20084, 495, 26409, 1610, 495, 26409, 20084, 495, 15, 20, 495, 537, 20, 495, 1051, 20, 209, 18081, 211, 18081, 116, 18081, 230, 39936, 222, 18081, 226, 39936, 213, 18081, 233, 18081, 117, 18081, 242, 39936, 212, 18081, 242, 18081, 97, 18081, 116, 18081, 216, 14931, 235, 212, 3736, 15367, 41197, 13610, 19934, 41869, 21275, 1012, 1047, 18795, 40120, 20422, 241, 16081, 6877, 12880, 11514, 1068, 8713, 38177, 13396, 3415, 9925, 12559, 10453, 1389, 42011, 35033, 34842, 11202, 9739, 9739, 33021, 18963, 4672, 25561, 8220, 309, 1849, 644, 686, 42618, 344, 434, 627, 13, 686, 1848, 368, 2119, 32, 686, 46, 417, 2119, 309, 1833, 1056, 352, 13, 686, 37, 368, 751, 690, 10331, 32, 844, 8, 31516, 247, 8, 77, 45]
chkhsh: b6dc8df998e1cfbdc4eac8243701a65afe638679230920b50d6f17d81c098166
pre_tokenizer: {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
}


model: starcoder
tokt: 2
repo: https://huggingface.co/bigcode/starcoder2-3b
chktok: [353, 736, 8886, 221, 10883, 4238, 16101, 28540, 222, 3822, 272, 246, 327, 4434, 46, 18445, 152, 46030, 45022, 142, 13878, 327, 12585, 19884, 33773, 40920, 751, 46, 41839, 5954, 137, 271, 3822, 137, 271, 244, 56, 244, 56, 56, 244, 56, 56, 56, 244, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 56, 56, 56, 244, 56, 51, 56, 244, 56, 516, 56, 244, 56, 1198, 56, 244, 14566, 246, 14566, 152, 14566, 265, 30428, 257, 14566, 261, 30428, 248, 14566, 268, 14566, 153, 14566, 277, 30428, 247, 14566, 277, 14566, 133, 14566, 152, 14566, 251, 36570, 247, 1037, 4995, 13379, 2924, 9515, 17823, 54, 56, 54, 57, 54, 58, 54, 11904, 47892, 20895, 16625, 13047, 8389, 1059, 9504, 40216, 13858, 2073, 8983, 12571, 1539, 10721, 5918, 9643, 13298, 932, 31723, 31330, 9221, 3226, 35426, 10400, 457, 4783, 2602, 349, 121, 1477, 957, 1200, 2038, 49, 349, 632, 863, 3673, 68, 349, 82, 666, 3673, 457, 4650, 1949, 580, 49, 349, 73, 863, 2144, 1649, 35941, 68, 2726, 44, 7728, 331, 44, 113, 81]
chkhsh: 35d91631860c815f952d711435f48d356ebac988362536bed955d43bfa436e34
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": true
        }
    ]
}


model: gpt-2
tokt: 2
repo: https://huggingface.co/openai-community/gpt2
chktok: [198, 220, 628, 220, 628, 198, 220, 197, 220, 197, 197, 220, 197, 198, 220, 220, 198, 220, 220, 220, 198, 220, 220, 220, 220, 198, 220, 220, 220, 220, 220, 198, 8582, 248, 222, 357, 11265, 8, 30325, 114, 447, 235, 8582, 234, 104, 37929, 357, 48101, 795, 13210, 271, 1673, 36686, 515, 8, 14519, 227, 12520, 99, 247, 8582, 99, 247, 513, 4747, 23460, 513, 20370, 23460, 2091, 23460, 20370, 23460, 24840, 23460, 2091, 20370, 513, 13, 18, 513, 492, 18, 513, 986, 18, 28053, 252, 222, 157, 252, 114, 157, 252, 241, 157, 253, 233, 157, 252, 237, 157, 253, 224, 157, 252, 244, 157, 252, 115, 157, 252, 253, 157, 253, 223, 157, 252, 253, 157, 252, 95, 157, 252, 114, 157, 252, 227, 47249, 223, 5633, 22755, 239, 46349, 111, 28839, 101, 18040, 32432, 98, 43291, 1485, 1415, 24309, 25465, 171, 121, 252, 40103, 1421, 18604, 12466, 121, 16843, 141, 231, 15166, 12466, 121, 16142, 12466, 239, 141, 232, 30143, 140, 111, 16142, 21169, 21727, 31583, 18849, 705, 39115, 6, 33153, 15506, 63, 15931, 15931, 16317, 13896, 3228, 9805, 3548, 314, 1053, 587, 705, 44040, 339, 338, 612, 11, 705, 2200, 345, 1654, 30, 705, 44, 407, 1654, 314, 1183, 787, 340, 11, 705, 35, 345, 588, 617, 8887, 30, 775, 6, 26979, 257, 6, 75, 43]
chkhsh: 3ce83efda5659b07b1ad37ca97ca5797ea4285d9b9ab0dc679e4a720c9da7454
pre_tokenizer: {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: command-r
tokt: 2
repo: https://huggingface.co/CohereForAI/c4ai-command-r-v01
chktok: [5, 127731, 51628, 205, 57788, 18494, 97469, 126134, 206, 2226, 256, 230, 1737, 18258, 16, 80503, 122, 35927, 2226, 242, 112, 57462, 1737, 54457, 223165, 106230, 2096, 16, 48389, 11254, 107, 255, 2226, 107, 255, 228, 26, 228, 26, 26, 228, 26, 26, 26, 228, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 26, 26, 26, 228, 26, 21, 26, 228, 26, 2271, 26, 228, 26, 3834, 26, 182018, 230, 174833, 38111, 249, 86325, 241, 38111, 245, 86325, 232, 38111, 252, 38111, 123, 38111, 261, 165, 24629, 38111, 261, 38111, 103, 174833, 38111, 235, 188568, 231, 5691, 12081, 13336, 2648, 29325, 14315, 24, 26, 24, 27, 24, 28, 24, 5123, 18372, 8391, 158343, 3512, 40071, 2196, 3236, 8750, 1764, 37097, 41168, 29721, 32797, 25646, 3802, 4975, 4975, 116167, 57178, 10251, 154048, 27292, 1767, 5125, 2632, 2155, 91, 2378, 1919, 1914, 2782, 19, 2155, 3354, 1933, 5470, 38, 2155, 52, 2068, 5470, 1767, 4961, 3059, 1894, 19, 2155, 43, 1933, 3026, 2725, 23186, 38, 2930, 14, 20676, 1671, 14, 83, 51]
chkhsh: 9c2227e4dd922002fb81bde4fc02b0483ca4f12911410dee2255e4987644e3f8
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": true
        }
    ]
}


    def get_vocab_base_pre(self, tokenizer) -> str:
        # encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that
        # is specific for the BPE pre-tokenizer used by the model
        # we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can
        # use in llama.cpp to implement the same pre-tokenizer

        chktxt = '\n \n\n \n\n\n \t \t\t \t\n  \n   \n    \n     \n🚀 (normal) 😶\u200d🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天～ ------======= нещо на Български \'\'\'\'\'\'```````""""......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

        chktok = tokenizer.encode(chktxt)
        chkhsh = sha256(str(chktok).encode()).hexdigest()

        print(f"chktok: {chktok}")
        print(f"chkhsh: {chkhsh}")

        res = None

        # NOTE: if you get an error here, you need to update the convert-hf-to-gguf-update.py script
        #       or pull the latest version of the model from Huggingface
        #       don't edit the hashes manually!
        if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5":
            # ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B
            res = "llama-bpe"
        if chkhsh == "049ecf7629871e3041641907f3de7c733e4dbfdc736f57d882ba0b0845599754":
            # ref: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base
            res = "deepseek-llm"
        if chkhsh == "347715f544604f9118bb75ed199f68779f423cabb20db6de6f31b908d04d7821":
            # ref: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
            res = "deepseek-coder"
        if chkhsh == "8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed":
            # ref: https://huggingface.co/tiiuae/falcon-7b
            res = "falcon"
        if chkhsh == "0876d13b50744004aa9aeae05e7b0647eac9d801b5ba4668afc01e709c15e19f":
            # ref: https://huggingface.co/BAAI/bge-small-en-v1.5
            res = "bert-bge"
        if chkhsh == "b6dc8df998e1cfbdc4eac8243701a65afe638679230920b50d6f17d81c098166":
            # ref: https://huggingface.co/mosaicml/mpt-7b
            res = "mpt"
        if chkhsh == "35d91631860c815f952d711435f48d356ebac988362536bed955d43bfa436e34":
            # ref: https://huggingface.co/bigcode/starcoder2-3b
            res = "starcoder"
        if chkhsh == "3ce83efda5659b07b1ad37ca97ca5797ea4285d9b9ab0dc679e4a720c9da7454":
            # ref: https://huggingface.co/openai-community/gpt2
            res = "gpt-2"
        if chkhsh == "9c2227e4dd922002fb81bde4fc02b0483ca4f12911410dee2255e4987644e3f8":
            # ref: https://huggingface.co/CohereForAI/c4ai-command-r-v01
            res = "command-r"

        if res is None:
            print("\n")
            print("**************************************************************************************")
            print("** WARNING: The BPE pre-tokenizer was not recognized!")
            print("**          There are 2 possible reasons for this:")
            print("**          - the model has not been added to convert-hf-to-gguf-update.py yet")
            print("**          - the pre-tokenization config has changed upstream")
            print("**          Check your model files and convert-hf-to-gguf-update.py and update them accordingly.")
            print("** ref:     https://github.com/ggerganov/llama.cpp/pull/6920")
            print("**")
            print(f"** chkhsh:  {chkhsh}")
            print("**************************************************************************************")
            print("\n")
            raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

        print(f"tokenizer.ggml.pre: {res}")
        print(f"chkhsh: {chkhsh}")

        return res



!!! Copy-paste the function above into convert-hf-to-gguf.py !!!


Tests for llama-spm written in ./models/ggml-vocab-llama-spm.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for llama-bpe written in ./models/ggml-vocab-llama-bpe.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for phi-3 written in ./models/ggml-vocab-phi-3.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for deepseek-llm written in ./models/ggml-vocab-deepseek-llm.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for deepseek-coder written in ./models/ggml-vocab-deepseek-coder.gguf.*
Tests for falcon written in ./models/ggml-vocab-falcon.gguf.*
Tests for bert-bge written in ./models/ggml-vocab-bert-bge.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for mpt written in ./models/ggml-vocab-mpt.gguf.*
Tests for starcoder written in ./models/ggml-vocab-starcoder.gguf.*
Tests for gpt-2 written in ./models/ggml-vocab-gpt-2.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for command-r written in ./models/ggml-vocab-command-r.gguf.*

Run the following commands to generate the vocab files for testing:

python3 convert-hf-to-gguf.py models/tokenizers/llama-spm/ --outfile models/ggml-vocab-llama-spm.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/llama-bpe/ --outfile models/ggml-vocab-llama-bpe.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/phi-3/ --outfile models/ggml-vocab-phi-3.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/deepseek-llm/ --outfile models/ggml-vocab-deepseek-llm.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/deepseek-coder/ --outfile models/ggml-vocab-deepseek-coder.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/falcon/ --outfile models/ggml-vocab-falcon.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/bert-bge/ --outfile models/ggml-vocab-bert-bge.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/mpt/ --outfile models/ggml-vocab-mpt.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/starcoder/ --outfile models/ggml-vocab-starcoder.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/gpt-2/ --outfile models/ggml-vocab-gpt-2.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/command-r/ --outfile models/ggml-vocab-command-r.gguf --vocab-only

dranger003 · 2024-05-03T19:51:51Z

Superseded by PR #7063.

Add BPE pre-tokenization for Command-R.

a34ace9

sealad886 suggested changes May 2, 2024

View reviewed changes

sealad886 mentioned this pull request May 2, 2024

Command-R GGUF conversion no longer working #7030

Closed

sealad886 approved these changes May 2, 2024

View reviewed changes

Support handling of LFS for download.

8242447

Add test for command-r tokenizer.

9cbad1b

dranger003 mentioned this pull request May 2, 2024

BPE pretokenizer - add support for command-r-plus and command-r models #7041

Closed

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

dranger003 mentioned this pull request May 3, 2024

Add BPE pre-tokenization for Command-R/R+. #7063

Merged

dranger003 closed this May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BPE pre-tokenization for Command-R. #7033

Add BPE pre-tokenization for Command-R. #7033

dranger003 commented May 2, 2024 •

edited

github-actions bot commented May 2, 2024 •

edited

ggerganov commented May 2, 2024

drummerv commented May 2, 2024

sealad886 left a comment

Rotatingxenomorph commented May 2, 2024

sealad886 commented May 2, 2024

Rotatingxenomorph commented May 2, 2024 •

edited

sealad886 commented May 2, 2024

sealad886 left a comment

dranger003 commented May 2, 2024 •

edited

dranger003 commented May 2, 2024

drummerv commented May 2, 2024

dranger003 commented May 2, 2024

dranger003 commented May 3, 2024

Add BPE pre-tokenization for Command-R. #7033

Add BPE pre-tokenization for Command-R. #7033

Conversation

dranger003 commented May 2, 2024 • edited

github-actions bot commented May 2, 2024 • edited

ggerganov commented May 2, 2024

drummerv commented May 2, 2024

sealad886 left a comment

Choose a reason for hiding this comment

Rotatingxenomorph commented May 2, 2024

sealad886 commented May 2, 2024

Rotatingxenomorph commented May 2, 2024 • edited

sealad886 commented May 2, 2024

sealad886 left a comment

Choose a reason for hiding this comment

dranger003 commented May 2, 2024 • edited

dranger003 commented May 2, 2024

drummerv commented May 2, 2024

dranger003 commented May 2, 2024

dranger003 commented May 3, 2024

dranger003 commented May 2, 2024 •

edited

github-actions bot commented May 2, 2024 •

edited

Rotatingxenomorph commented May 2, 2024 •

edited

dranger003 commented May 2, 2024 •

edited