Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BPE pre-tokenization for Command-R. #7033

Closed
wants to merge 3 commits into from

Conversation

dranger003
Copy link
Contributor

@dranger003 dranger003 commented May 2, 2024

I read #6920 and 120cf37 and attempting to add support Command-R support.

Closes #7030 and #7040.

./build/bin/test-tokenizer-0 models/ggml-vocab-command-r.gguf
...
Tests passed

EDIT: I also tested Command-R+ successfully using this PR.

Copy link
Contributor

github-actions bot commented May 2, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8515.43ms p(95)=20551.35ms fails=, finish reason: stop=489 truncated=61
  • Prompt processing (pp): avg=99.05tk/s p(95)=446.8tk/s
  • Token generation (tg): avg=33.63tk/s p(95)=45.35tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=bpe-pretok-command-r commit=9cbad1b2cf4852fc6cd7ff8eab3c41734cea6e07

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 570.5, 570.5, 570.5, 570.5, 570.5, 772.11, 772.11, 772.11, 772.11, 772.11, 666.45, 666.45, 666.45, 666.45, 666.45, 712.62, 712.62, 712.62, 712.62, 712.62, 767.81, 767.81, 767.81, 767.81, 767.81, 778.27, 778.27, 778.27, 778.27, 778.27, 778.55, 778.55, 778.55, 778.55, 778.55, 811.87, 811.87, 811.87, 811.87, 811.87, 810.2, 810.2, 810.2, 810.2, 810.2, 812.02, 812.02, 812.02, 812.02, 812.02, 838.12, 838.12, 838.12, 838.12, 838.12, 868.27, 868.27, 868.27, 868.27, 868.27, 894.9, 894.9, 894.9, 894.9, 894.9, 835.57, 835.57, 835.57, 835.57, 835.57, 842.47, 842.47, 842.47, 842.47, 842.47, 842.57, 842.57, 842.57, 842.57, 842.57, 840.07, 840.07, 840.07, 840.07, 840.07, 813.54, 813.54, 813.54, 813.54, 813.54, 812.52, 812.52, 812.52, 812.52, 812.52, 812.27, 812.27, 812.27, 812.27, 812.27, 820.03, 820.03, 820.03, 820.03, 820.03, 816.47, 816.47, 816.47, 816.47, 816.47, 828.49, 828.49, 828.49, 828.49, 828.49, 833.13, 833.13, 833.13, 833.13, 833.13, 828.21, 828.21, 828.21, 828.21, 828.21, 827.08, 827.08, 827.08, 827.08, 827.08, 824.19, 824.19, 824.19, 824.19, 824.19, 823.88, 823.88, 823.88, 823.88, 823.88, 823.42, 823.42, 823.42, 823.42, 823.42, 828.42, 828.42, 828.42, 828.42, 828.42, 829.83, 829.83, 829.83, 829.83, 829.83, 827.81, 827.81, 827.81, 827.81, 827.81, 832.7, 832.7, 832.7, 832.7, 832.7, 844.49, 844.49, 844.49, 844.49, 844.49, 849.91, 849.91, 849.91, 849.91, 849.91, 858.9, 858.9, 858.9, 858.9, 858.9, 858.45, 858.45, 858.45, 858.45, 858.45, 855.88, 855.88, 855.88, 855.88, 855.88, 858.04, 858.04, 858.04, 858.04, 858.04, 859.34, 859.34, 859.34, 859.34, 859.34, 858.77, 858.77, 858.77, 858.77, 858.77, 865.77, 865.77, 865.77, 865.77, 865.77, 847.64, 847.64, 847.64, 847.64, 847.64, 846.71, 846.71, 846.71, 846.71, 846.71, 845.25, 845.25, 845.25, 845.25, 845.25, 845.64, 845.64, 845.64, 845.64, 845.64, 852.07, 852.07, 852.07, 852.07, 852.07, 851.57, 851.57, 851.57, 851.57, 851.57, 857.48, 857.48, 857.48, 857.48, 857.48, 856.26, 856.26, 856.26, 856.26, 856.26, 858.65, 858.65, 858.65, 858.65, 858.65, 862.37, 862.37, 862.37, 862.37, 862.37, 861.33, 861.33, 861.33, 861.33, 861.33, 866.91, 866.91, 866.91, 866.91, 866.91, 868.21, 868.21, 868.21, 868.21, 868.21, 867.7, 867.7, 867.7, 867.7, 867.7, 868.47, 868.47, 868.47, 868.47, 868.47, 867.71, 867.71, 867.71, 867.71, 867.71, 870.35, 870.35, 870.35, 870.35, 870.35, 872.6, 872.6, 872.6, 872.6, 872.6, 871.26, 871.26, 871.26, 871.26]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.05, 36.05, 36.05, 36.05, 36.05, 37.32, 37.32, 37.32, 37.32, 37.32, 31.76, 31.76, 31.76, 31.76, 31.76, 29.58, 29.58, 29.58, 29.58, 29.58, 30.78, 30.78, 30.78, 30.78, 30.78, 31.33, 31.33, 31.33, 31.33, 31.33, 32.93, 32.93, 32.93, 32.93, 32.93, 34.07, 34.07, 34.07, 34.07, 34.07, 34.38, 34.38, 34.38, 34.38, 34.38, 34.76, 34.76, 34.76, 34.76, 34.76, 34.67, 34.67, 34.67, 34.67, 34.67, 34.46, 34.46, 34.46, 34.46, 34.46, 33.72, 33.72, 33.72, 33.72, 33.72, 32.94, 32.94, 32.94, 32.94, 32.94, 32.75, 32.75, 32.75, 32.75, 32.75, 32.54, 32.54, 32.54, 32.54, 32.54, 32.52, 32.52, 32.52, 32.52, 32.52, 32.53, 32.53, 32.53, 32.53, 32.53, 31.94, 31.94, 31.94, 31.94, 31.94, 31.55, 31.55, 31.55, 31.55, 31.55, 31.46, 31.46, 31.46, 31.46, 31.46, 31.57, 31.57, 31.57, 31.57, 31.57, 31.69, 31.69, 31.69, 31.69, 31.69, 31.33, 31.33, 31.33, 31.33, 31.33, 31.53, 31.53, 31.53, 31.53, 31.53, 31.76, 31.76, 31.76, 31.76, 31.76, 31.65, 31.65, 31.65, 31.65, 31.65, 31.11, 31.11, 31.11, 31.11, 31.11, 31.18, 31.18, 31.18, 31.18, 31.18, 31.37, 31.37, 31.37, 31.37, 31.37, 31.45, 31.45, 31.45, 31.45, 31.45, 31.59, 31.59, 31.59, 31.59, 31.59, 31.72, 31.72, 31.72, 31.72, 31.72, 31.58, 31.58, 31.58, 31.58, 31.58, 31.57, 31.57, 31.57, 31.57, 31.57, 31.35, 31.35, 31.35, 31.35, 31.35, 30.99, 30.99, 30.99, 30.99, 30.99, 31.11, 31.11, 31.11, 31.11, 31.11, 31.18, 31.18, 31.18, 31.18, 31.18, 31.32, 31.32, 31.32, 31.32, 31.32, 31.42, 31.42, 31.42, 31.42, 31.42, 31.31, 31.31, 31.31, 31.31, 31.31, 31.18, 31.18, 31.18, 31.18, 31.18, 30.64, 30.64, 30.64, 30.64, 30.64, 30.2, 30.2, 30.2, 30.2, 30.2, 29.78, 29.78, 29.78, 29.78, 29.78, 29.7, 29.7, 29.7, 29.7, 29.7, 29.78, 29.78, 29.78, 29.78, 29.78, 29.87, 29.87, 29.87, 29.87, 29.87, 29.91, 29.91, 29.91, 29.91, 29.91, 30.01, 30.01, 30.01, 30.01, 30.01, 30.04, 30.04, 30.04, 30.04, 30.04, 29.86, 29.86, 29.86, 29.86, 29.86, 29.7, 29.7, 29.7, 29.7, 29.7, 29.83, 29.83, 29.83, 29.83, 29.83, 29.93, 29.93, 29.93, 29.93, 29.93, 30.06, 30.06, 30.06, 30.06, 30.06, 30.2, 30.2, 30.2, 30.2, 30.2, 30.25, 30.25, 30.25, 30.25, 30.25, 30.27, 30.27, 30.27, 30.27, 30.27, 30.31, 30.31, 30.31, 30.31]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.41, 0.41, 0.41, 0.41, 0.41, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.33, 0.33, 0.33, 0.33, 0.33, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.33, 0.33, 0.33, 0.33, 0.33, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.45, 0.45, 0.45, 0.45, 0.45, 0.5, 0.5, 0.5, 0.5, 0.5, 0.45, 0.45, 0.45, 0.45, 0.45, 0.41, 0.41, 0.41, 0.41, 0.41, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714650185 --> 1714650811
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0]
                    

@ggerganov
Copy link
Owner

I tried to run the convert-hf-to-gguf-update.py script but it failed because tokenizer.json is LFS in that repo. So we need to handle it:

diff --git a/convert-hf-to-gguf-update.py b/convert-hf-to-gguf-update.py
index f4774003..4ad4d867 100644
--- a/convert-hf-to-gguf-update.py
+++ b/convert-hf-to-gguf-update.py
@@ -95,6 +95,14 @@ for model in models:
     save_path = f"models/tokenizers/{name}/tokenizer.json"
     download_file_with_auth(url, token, save_path)
 
+    # if downloaded file is less than 1KB, we likely need to download an LFS instead
+    if os.path.getsize(save_path) < 1024:
+        # remove the file
+        os.remove(save_path)
+        url = f"{repo}/resolve/main/tokenizer.json"
+        save_path = f"models/tokenizers/{name}/tokenizer.json"
+        download_file_with_auth(url, token, save_path)
+
     if tokt == TOKENIZER_TYPE.SPM:
         url = f"{repo}/resolve/main/tokenizer.model"
         save_path = f"models/tokenizers/{name}/tokenizer.model"

@drummerv
Copy link

drummerv commented May 2, 2024

Lets goooooO!

Copy link

@sealad886 sealad886 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had an a-ha moment, and these look good. I've run it and got the same.

Optional: add in command-r-plus with nearly identical footprint, since the pre-tokenizers are the same (couple ways to make that happen).

One note: tokenizer.config specifies the Digits pre-tokenizer is used and individual_digits=True. There's not a regex pattern in llama.cpp explicitly matching individual digits. EVen though the tests pass, would that be a possibilty to add in for completeness sake?

@Rotatingxenomorph
Copy link

I get why it's changed, but does it mean the old quants are broken even when using the older versions of llamacpp? I'm not sure I can face downloading and quantizing it myself.

@sealad886
Copy link

I get why it's changed, but does it mean the old quants are broken even when using the older versions of llamacpp? I'm not sure I can face downloading and quantizing it myself.

My understanding of the overall issue (which admittedly is not as nuanced as others') would suggest that, for the most part, yes you should re-download and re-convert and re-quantize your models. The longer version is that newer models are more likely to use newer pre-tokenizer splits.

Luckily, command-r and command-r-plus appear to be nearly identical to the default pre-tokenizer's splits, so for these models it's probably okay?

@Rotatingxenomorph
Copy link

Rotatingxenomorph commented May 2, 2024

I get why it's changed, but does it mean the old quants are broken even when using the older versions of llamacpp? I'm not sure I can face downloading and quantizing it myself.

My understanding of the overall issue (which admittedly is not as nuanced as others') would suggest that, for the most part, yes you should re-download and re-convert and re-quantize your models. The longer version is that newer models are more likely to use newer pre-tokenizer splits.

Luckily, command-r and command-r-plus appear to be nearly identical to the default pre-tokenizer's splits, so for these models it's probably okay?

I just remembered that the imat quants will definitely be incorrect for the new version of llamacpp. I have a non-imat q6k that I was hoping to use until some kind person makes a new q8 for huggingface ;)

@sealad886
Copy link

I just remembered that the imat quants will definitely be incorrect for the new version of llamacpp. I have a non-imat q6k that I was hoping to use until some kind person makes a new q8 for huggingface ;)

Yeah exactly...lots of componding issues. I've ended up moving some of these over to Ollama as well, so now I have to untangle what came from me and what I pulled from them. But I wouldn't be surprised if the majority of Ollama models have this same issue and will need to be re-built.

Copy link

@sealad886 sealad886 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, yeah. Let's just approve this to get it in there, and I'll add in command-r-plus separately.

@dranger003
Copy link
Contributor Author

dranger003 commented May 2, 2024

This PR works for both Command-R version, since they have the same hash.

@dranger003
Copy link
Contributor Author

I tried to run the convert-hf-to-gguf-update.py script but it failed because tokenizer.json is LFS in that repo. So we need to handle it:

diff --git a/convert-hf-to-gguf-update.py b/convert-hf-to-gguf-update.py
index f4774003..4ad4d867 100644
--- a/convert-hf-to-gguf-update.py
+++ b/convert-hf-to-gguf-update.py
@@ -95,6 +95,14 @@ for model in models:
     save_path = f"models/tokenizers/{name}/tokenizer.json"
     download_file_with_auth(url, token, save_path)
 
+    # if downloaded file is less than 1KB, we likely need to download an LFS instead
+    if os.path.getsize(save_path) < 1024:
+        # remove the file
+        os.remove(save_path)
+        url = f"{repo}/resolve/main/tokenizer.json"
+        save_path = f"models/tokenizers/{name}/tokenizer.json"
+        download_file_with_auth(url, token, save_path)
+
     if tokt == TOKENIZER_TYPE.SPM:
         url = f"{repo}/resolve/main/tokenizer.model"
         save_path = f"models/tokenizers/{name}/tokenizer.model"

Thanks, I added this update to the PR.

@drummerv
Copy link

drummerv commented May 2, 2024

python3 ./llama.cpp/convert-hf-to-gguf-update.py hf_token
Directory models/tokenizers/llama-spm already exists - skipping
Downloading llama-bpe to models/tokenizers/llama-bpe
Failed to download file. Status code: 403
Failed to download file. Status code: 403
Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert-hf-to-gguf-update.py", line 99, in <module>
    if os.path.getsize(save_path) < 1024:
  File "/usr/lib/python3.10/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: 'models/tokenizers/llama-bpe/tokenizer.json'
root@6efd24cdcbac:/workspace# 

had to run this several times (it'll create the folder and then error out it seems)

File models/tokenizers/command-r/tokenizer_config.json downloaded successfully
Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert-hf-to-gguf-update.py", line 127, in <module>
    from transformers import AutoTokenizer
ModuleNotFoundError: No module named 'transformers'

i think cmd-r was already done at this point but i just wanted to point this out.

Directory models/tokenizers/command-r already exists - skipping
Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert-hf-to-gguf-update.py", line 128, in <module>
    tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 369, in cached_file
    raise EnvironmentError(
OSError: models/tokenizers/llama-bpe does not appear to have a file named config.json. Checkout 'https://huggingface.co/models/tokenizers/llama-bpe/tree/None' for available files.

blocked, but cmd-r is done anyway

Got to convert Command R 35B into a f16 GGUF. I'm currently quantizing them and I'll test out Q5KM soon.

@dranger003
Copy link
Contributor Author

@drummerv I am not getting any errors.

convert-hf-to-gguf-update.py
$ python ./convert-hf-to-gguf-update.py hf_token
Downloading llama-spm to models/tokenizers/llama-spm
File models/tokenizers/llama-spm/config.json downloaded successfully
File models/tokenizers/llama-spm/tokenizer.json downloaded successfully
File models/tokenizers/llama-spm/tokenizer.model downloaded successfully
File models/tokenizers/llama-spm/tokenizer_config.json downloaded successfully
Downloading llama-bpe to models/tokenizers/llama-bpe
File models/tokenizers/llama-bpe/config.json downloaded successfully
File models/tokenizers/llama-bpe/tokenizer.json downloaded successfully
File models/tokenizers/llama-bpe/tokenizer_config.json downloaded successfully
Downloading phi-3 to models/tokenizers/phi-3
File models/tokenizers/phi-3/config.json downloaded successfully
File models/tokenizers/phi-3/tokenizer.json downloaded successfully
File models/tokenizers/phi-3/tokenizer.model downloaded successfully
File models/tokenizers/phi-3/tokenizer_config.json downloaded successfully
Downloading deepseek-llm to models/tokenizers/deepseek-llm
File models/tokenizers/deepseek-llm/config.json downloaded successfully
File models/tokenizers/deepseek-llm/tokenizer.json downloaded successfully
File models/tokenizers/deepseek-llm/tokenizer_config.json downloaded successfully
Downloading deepseek-coder to models/tokenizers/deepseek-coder
File models/tokenizers/deepseek-coder/config.json downloaded successfully
File models/tokenizers/deepseek-coder/tokenizer.json downloaded successfully
File models/tokenizers/deepseek-coder/tokenizer_config.json downloaded successfully
Downloading falcon to models/tokenizers/falcon
File models/tokenizers/falcon/config.json downloaded successfully
File models/tokenizers/falcon/tokenizer.json downloaded successfully
File models/tokenizers/falcon/tokenizer_config.json downloaded successfully
Downloading bert-bge to models/tokenizers/bert-bge
File models/tokenizers/bert-bge/config.json downloaded successfully
File models/tokenizers/bert-bge/tokenizer.json downloaded successfully
File models/tokenizers/bert-bge/tokenizer_config.json downloaded successfully
Downloading mpt to models/tokenizers/mpt
File models/tokenizers/mpt/config.json downloaded successfully
File models/tokenizers/mpt/tokenizer.json downloaded successfully
File models/tokenizers/mpt/tokenizer_config.json downloaded successfully
Downloading starcoder to models/tokenizers/starcoder
File models/tokenizers/starcoder/config.json downloaded successfully
File models/tokenizers/starcoder/tokenizer.json downloaded successfully
File models/tokenizers/starcoder/tokenizer_config.json downloaded successfully
Downloading gpt-2 to models/tokenizers/gpt-2
File models/tokenizers/gpt-2/config.json downloaded successfully
File models/tokenizers/gpt-2/tokenizer.json downloaded successfully
File models/tokenizers/gpt-2/tokenizer_config.json downloaded successfully
Downloading command-r to models/tokenizers/command-r
File models/tokenizers/command-r/config.json downloaded successfully
File models/tokenizers/command-r/tokenizer.json downloaded successfully
File models/tokenizers/command-r/tokenizer.json downloaded successfully
File models/tokenizers/command-r/tokenizer_config.json downloaded successfully
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: llama-bpe
tokt: 2
repo: https://huggingface.co/meta-llama/Meta-Llama-3-8B
chktok: [128000, 198, 4815, 15073, 66597, 8004, 1602, 2355, 79772, 11187, 9468, 248, 222, 320, 8416, 8, 27623, 114, 102470, 9468, 234, 104, 31643, 320, 36773, 100166, 98634, 8, 26602, 227, 11410, 99, 247, 9468, 99, 247, 220, 18, 220, 1644, 220, 8765, 220, 8765, 18, 220, 8765, 1644, 220, 8765, 8765, 220, 8765, 8765, 18, 220, 8765, 8765, 1644, 220, 18, 13, 18, 220, 18, 497, 18, 220, 18, 1131, 18, 220, 21549, 222, 98629, 241, 45358, 233, 21549, 237, 45358, 224, 21549, 244, 21549, 115, 21549, 253, 45358, 223, 21549, 253, 21549, 95, 98629, 227, 76460, 223, 949, 37046, 101067, 19000, 23182, 102301, 9263, 18136, 16, 36827, 21909, 56560, 54337, 19175, 102118, 13373, 64571, 34694, 3114, 112203, 80112, 3436, 106451, 14196, 14196, 74694, 3089, 3089, 29249, 17523, 3001, 27708, 7801, 358, 3077, 1027, 364, 83, 820, 568, 596, 1070, 11, 364, 793, 499, 2771, 30, 364, 44, 539, 2771, 358, 3358, 1304, 433, 11, 364, 35, 499, 1093, 1063, 15600, 30, 1226, 6, 43712, 264, 64966, 43]
chkhsh: 0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Split",
            "pattern": {
                "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": false
        }
    ]
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: deepseek-llm
tokt: 2
repo: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base
chktok: [100000, 185, 207, 185, 185, 207, 185, 185, 185, 207, 11969, 486, 22504, 185, 243, 185, 300, 185, 251, 185, 663, 185, 10044, 95300, 334, 8754, 8, 33701, 114, 350, 222, 10044, 221, 104, 46713, 334, 34732, 996, 24250, 262, 80923, 8, 207, 37103, 214, 12356, 99, 234, 10044, 99, 234, 207, 18, 207, 18, 18, 207, 18, 18, 18, 207, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 18, 207, 18, 13, 18, 207, 18, 526, 18, 207, 18, 1204, 18, 207, 71374, 209, 71374, 114, 71374, 228, 155, 240, 220, 71374, 224, 155, 240, 211, 71374, 231, 71374, 115, 71374, 240, 155, 240, 210, 71374, 240, 71374, 95, 71374, 114, 71374, 214, 71899, 210, 3025, 19017, 612, 9407, 2681, 16, 18, 16, 19, 16, 20, 16, 1398, 68940, 239, 78827, 55170, 76659, 620, 91754, 31116, 36804, 4885, 4885, 10897, 4390, 4390, 41047, 15278, 3033, 14986, 5675, 304, 6, 313, 803, 655, 33326, 362, 6, 82, 745, 11, 655, 1374, 340, 2049, 30, 655, 44, 441, 2049, 304, 6, 647, 1099, 359, 11, 655, 35, 340, 837, 742, 10842, 30, 1003, 6, 10699, 245, 6, 75, 43]
chkhsh: 049ecf7629871e3041641907f3de7c733e4dbfdc736f57d882ba0b0845599754
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\r\n]"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?[A-Za-z\u00b5\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u01ba\u01bc-\u01bf\u01c4-\u0293\u0295-\u02af\u0370-\u0373\u0376\u0377\u037b-\u037d\u037f\u0386\u0388-\u038a\u038c\u038e-\u03a1\u03a3-\u03f5\u03f7-\u0481\u048a-\u052f\u0531-\u0556\u10a0-\u10c5\u13a0-\u13f5\u13f8-\u13fd\u1c90-\u1cba\u1cbd-\u1cbf\u1d00-\u1d2b\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1f15\u1f18-\u1f1d\u1f20-\u1f45\u1f48-\u1f4d\u1f50-\u1f57\u1f59\u1f5b\u1f5d\u1f5f-\u1f7d\u1f80-\u1fb4\u1fb6-\u1fbc\u1fbe\u1fc2-\u1fc4\u1fc6-\u1fcc\u1fd0-\u1fd3\u1fd6-\u1fdb\u1fe0-\u1fec\u1ff2-\u1ff4\u1ff6-\u1ffc\u2102\u2107\u210a-\u2113\u2115\u2119-\u211d\u2124\u2126\u2128\u212a-\u212d\u212f-\u2134\u2139\u213c-\u213f\u2145-\u2149\u214e\u2183\u2184\u2c00-\u2c7b\u2c7e-\u2ce4\u2ceb-\u2cee\u2cf2\u2cf3\ua640-\ua66d\ua680-\ua69b\ua722-\ua76f\ua771-\ua787\ua78b-\ua78e\uab70-\uabbf\ufb00-\ufb06\ufb13-\ufb17\uff21-\uff3a\uff41-\uff5a\ud801\udc00-\ud801\udc4f\ud801\udcb0-\ud801\udcd3\ud801\udcd8-\ud801\udcfb\ud803\udc80-\ud803\udcb2\ud803\udcc0-\ud803\udcf2\ud806\udca0-\ud806\udcdf\ud83a\udd00-\ud83a\udd43]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?[!-/:-~\uff01-\uff0f\uff1a-\uff5e\u2018-\u201f\u3000-\u3002]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s+$"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\u4e00-\u9fa5\u0800-\u4e00\uac00-\ud7ff]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": false
        }
    ]
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: deepseek-coder
tokt: 2
repo: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
chktok: [32013, 185, 207, 185, 185, 207, 185, 185, 185, 207, 12405, 459, 22758, 185, 243, 185, 315, 185, 251, 185, 730, 185, 10047, 235, 209, 334, 8760, 8, 12394, 233, 114, 350, 222, 10047, 221, 104, 169, 116, 224, 334, 4684, 3909, 992, 24330, 262, 29651, 612, 8, 207, 156, 237, 214, 12394, 99, 234, 10047, 99, 234, 207, 18, 207, 18, 18, 207, 18, 18, 18, 207, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 207, 18, 18, 18, 18, 18, 18, 18, 18, 207, 18, 13, 18, 207, 18, 524, 18, 207, 18, 1202, 18, 207, 155, 239, 209, 155, 239, 114, 155, 239, 228, 155, 240, 220, 155, 239, 224, 155, 240, 211, 155, 239, 231, 155, 239, 115, 155, 239, 240, 155, 240, 210, 155, 239, 240, 155, 239, 95, 155, 239, 114, 155, 239, 214, 10047, 233, 210, 3015, 19100, 608, 9413, 2668, 16, 18, 16, 19, 16, 20, 16, 1393, 169, 121, 239, 18155, 374, 17194, 28, 2861, 6478, 616, 2251, 14994, 31269, 4191, 6, 4686, 4686, 10252, 3358, 3358, 3409, 524, 15330, 3023, 15031, 5668, 303, 6, 312, 798, 651, 83, 839, 362, 6, 82, 741, 11, 651, 1369, 340, 2037, 30, 651, 44, 441, 2037, 303, 6, 642, 1098, 359, 11, 651, 35, 340, 833, 738, 10860, 30, 998, 6, 10709, 245, 6, 75, 43]
chkhsh: 347715f544604f9118bb75ed199f68779f423cabb20db6de6f31b908d04d7821
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\r\n]"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?\\p{L}+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "\\s?\\p{P}+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "[\u4e00-\u9fa5\u0800-\u4e00\uac00-\ud7ff]+"
            },
            "behavior": "Isolated",
            "invert": false
        },
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": false
        }
    ]
}


model: falcon
tokt: 2
repo: https://huggingface.co/tiiuae/falcon-7b
chktok: [1212, 4824, 1001, 1212, 192, 204, 663, 49453, 2069, 742, 561, 1501, 193, 2571, 232, 206, 204, 19, 11003, 20, 8196, 126, 283, 219, 48778, 116, 13392, 204, 19, 51831, 732, 63209, 1741, 7955, 522, 20, 22438, 211, 3346, 111, 231, 2571, 111, 231, 204, 30, 204, 3138, 204, 22287, 204, 22287, 30, 204, 22287, 3138, 204, 22287, 22287, 204, 22287, 22287, 30, 204, 22287, 22287, 3138, 204, 30, 25, 30, 204, 30, 513, 30, 204, 30, 951, 30, 27171, 236, 206, 38154, 126, 38154, 225, 167, 237, 217, 38154, 221, 167, 237, 208, 38154, 228, 38154, 127, 38154, 237, 167, 237, 207, 38154, 237, 38154, 107, 38154, 126, 38154, 211, 20589, 207, 204, 42, 50087, 123, 2727, 20300, 32022, 133, 234, 17419, 30137, 28, 7858, 181, 133, 236, 204, 37057, 2228, 10666, 5052, 133, 6207, 151, 215, 150, 134, 5052, 133, 6279, 5052, 223, 151, 216, 49679, 123, 53110, 47043, 7795, 204, 7544, 7544, 7544, 8543, 8543, 17593, 3513, 3513, 12844, 51520, 17664, 4247, 295, 18, 298, 650, 204, 18, 95, 693, 332, 18, 94, 629, 23, 204, 18, 1553, 299, 1310, 42, 204, 18, 56, 416, 1310, 295, 18, 567, 717, 334, 23, 204, 18, 47, 299, 606, 596, 6696, 42, 703, 18, 16139, 241, 18, 87, 55]
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Punctuation",
            "behavior": "Contiguous"
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": true
        },
        {
            "type": "Digits",
            "individual_digits": false
        },
        {
            "type": "Split",
            "pattern": {
                "Regex": "[0-9][0-9][0-9]"
            },
            "behavior": "Isolated",
            "invert": false
        }
    ]
}


model: bert-bge
tokt: 3
repo: https://huggingface.co/BAAI/bge-small-en-v1.5
chktok: [101, 100, 1006, 3671, 1007, 100, 1006, 3674, 7861, 29147, 2483, 9530, 16280, 23854, 1007, 100, 100, 1017, 3943, 21211, 21211, 2509, 21211, 22394, 21211, 22394, 2509, 21211, 22394, 22394, 21211, 22394, 22394, 2509, 1017, 1012, 1017, 1017, 1012, 1012, 1017, 1017, 1012, 1012, 1012, 1017, 100, 1029, 1855, 100, 100, 6207, 100, 100, 14677, 23632, 22203, 1811, 1995, 1011, 1011, 1011, 1011, 1011, 1011, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1192, 15290, 29754, 14150, 1192, 10260, 1181, 29755, 29436, 29741, 10260, 16856, 29747, 23925, 10325, 1005, 1005, 1005, 1005, 1005, 1005, 1036, 1036, 1036, 1036, 1036, 1036, 1036, 1000, 1000, 1000, 1000, 1012, 1012, 1012, 1012, 1012, 1012, 999, 999, 999, 999, 999, 999, 1029, 1029, 1029, 1029, 1029, 1029, 1045, 1005, 2310, 2042, 1005, 2409, 2002, 1005, 1055, 2045, 1010, 1005, 2128, 2017, 2469, 1029, 1005, 1049, 2025, 2469, 1045, 1005, 2222, 2191, 2009, 1010, 1005, 1040, 2017, 2066, 2070, 5572, 1029, 2057, 1005, 2310, 1037, 1005, 2222, 102]
chkhsh: 0876d13b50744004aa9aeae05e7b0647eac9d801b5ba4668afc01e709c15e19f
pre_tokenizer: {
    "type": "BertPreTokenizer"
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: mpt
tokt: 2
repo: https://huggingface.co/mosaicml/mpt-7b
chktok: [586, 1744, 33525, 186, 209, 623, 28910, 187, 50276, 187, 50275, 187, 50274, 187, 50273, 187, 14931, 237, 211, 313, 6320, 10, 49042, 116, 325, 224, 14931, 223, 106, 171, 118, 226, 313, 34263, 802, 13511, 261, 32147, 456, 10, 3384, 239, 216, 22692, 101, 236, 14931, 101, 236, 495, 5922, 30057, 495, 20084, 495, 26409, 30057, 20084, 495, 26409, 1610, 495, 26409, 20084, 495, 15, 20, 495, 537, 20, 495, 1051, 20, 209, 18081, 211, 18081, 116, 18081, 230, 39936, 222, 18081, 226, 39936, 213, 18081, 233, 18081, 117, 18081, 242, 39936, 212, 18081, 242, 18081, 97, 18081, 116, 18081, 216, 14931, 235, 212, 3736, 15367, 41197, 13610, 19934, 41869, 21275, 1012, 1047, 18795, 40120, 20422, 241, 16081, 6877, 12880, 11514, 1068, 8713, 38177, 13396, 3415, 9925, 12559, 10453, 1389, 42011, 35033, 34842, 11202, 9739, 9739, 33021, 18963, 4672, 25561, 8220, 309, 1849, 644, 686, 42618, 344, 434, 627, 13, 686, 1848, 368, 2119, 32, 686, 46, 417, 2119, 309, 1833, 1056, 352, 13, 686, 37, 368, 751, 690, 10331, 32, 844, 8, 31516, 247, 8, 77, 45]
chkhsh: b6dc8df998e1cfbdc4eac8243701a65afe638679230920b50d6f17d81c098166
pre_tokenizer: {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
}


model: starcoder
tokt: 2
repo: https://huggingface.co/bigcode/starcoder2-3b
chktok: [353, 736, 8886, 221, 10883, 4238, 16101, 28540, 222, 3822, 272, 246, 327, 4434, 46, 18445, 152, 46030, 45022, 142, 13878, 327, 12585, 19884, 33773, 40920, 751, 46, 41839, 5954, 137, 271, 3822, 137, 271, 244, 56, 244, 56, 56, 244, 56, 56, 56, 244, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 56, 56, 244, 56, 56, 56, 56, 56, 56, 56, 56, 244, 56, 51, 56, 244, 56, 516, 56, 244, 56, 1198, 56, 244, 14566, 246, 14566, 152, 14566, 265, 30428, 257, 14566, 261, 30428, 248, 14566, 268, 14566, 153, 14566, 277, 30428, 247, 14566, 277, 14566, 133, 14566, 152, 14566, 251, 36570, 247, 1037, 4995, 13379, 2924, 9515, 17823, 54, 56, 54, 57, 54, 58, 54, 11904, 47892, 20895, 16625, 13047, 8389, 1059, 9504, 40216, 13858, 2073, 8983, 12571, 1539, 10721, 5918, 9643, 13298, 932, 31723, 31330, 9221, 3226, 35426, 10400, 457, 4783, 2602, 349, 121, 1477, 957, 1200, 2038, 49, 349, 632, 863, 3673, 68, 349, 82, 666, 3673, 457, 4650, 1949, 580, 49, 349, 73, 863, 2144, 1649, 35941, 68, 2726, 44, 7728, 331, 44, 113, 81]
chkhsh: 35d91631860c815f952d711435f48d356ebac988362536bed955d43bfa436e34
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": true
        }
    ]
}


model: gpt-2
tokt: 2
repo: https://huggingface.co/openai-community/gpt2
chktok: [198, 220, 628, 220, 628, 198, 220, 197, 220, 197, 197, 220, 197, 198, 220, 220, 198, 220, 220, 220, 198, 220, 220, 220, 220, 198, 220, 220, 220, 220, 220, 198, 8582, 248, 222, 357, 11265, 8, 30325, 114, 447, 235, 8582, 234, 104, 37929, 357, 48101, 795, 13210, 271, 1673, 36686, 515, 8, 14519, 227, 12520, 99, 247, 8582, 99, 247, 513, 4747, 23460, 513, 20370, 23460, 2091, 23460, 20370, 23460, 24840, 23460, 2091, 20370, 513, 13, 18, 513, 492, 18, 513, 986, 18, 28053, 252, 222, 157, 252, 114, 157, 252, 241, 157, 253, 233, 157, 252, 237, 157, 253, 224, 157, 252, 244, 157, 252, 115, 157, 252, 253, 157, 253, 223, 157, 252, 253, 157, 252, 95, 157, 252, 114, 157, 252, 227, 47249, 223, 5633, 22755, 239, 46349, 111, 28839, 101, 18040, 32432, 98, 43291, 1485, 1415, 24309, 25465, 171, 121, 252, 40103, 1421, 18604, 12466, 121, 16843, 141, 231, 15166, 12466, 121, 16142, 12466, 239, 141, 232, 30143, 140, 111, 16142, 21169, 21727, 31583, 18849, 705, 39115, 6, 33153, 15506, 63, 15931, 15931, 16317, 13896, 3228, 9805, 3548, 314, 1053, 587, 705, 44040, 339, 338, 612, 11, 705, 2200, 345, 1654, 30, 705, 44, 407, 1654, 314, 1183, 787, 340, 11, 705, 35, 345, 588, 617, 8887, 30, 775, 6, 26979, 257, 6, 75, 43]
chkhsh: 3ce83efda5659b07b1ad37ca97ca5797ea4285d9b9ab0dc679e4a720c9da7454
pre_tokenizer: {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model: command-r
tokt: 2
repo: https://huggingface.co/CohereForAI/c4ai-command-r-v01
chktok: [5, 127731, 51628, 205, 57788, 18494, 97469, 126134, 206, 2226, 256, 230, 1737, 18258, 16, 80503, 122, 35927, 2226, 242, 112, 57462, 1737, 54457, 223165, 106230, 2096, 16, 48389, 11254, 107, 255, 2226, 107, 255, 228, 26, 228, 26, 26, 228, 26, 26, 26, 228, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 26, 26, 228, 26, 26, 26, 26, 26, 26, 26, 26, 228, 26, 21, 26, 228, 26, 2271, 26, 228, 26, 3834, 26, 182018, 230, 174833, 38111, 249, 86325, 241, 38111, 245, 86325, 232, 38111, 252, 38111, 123, 38111, 261, 165, 24629, 38111, 261, 38111, 103, 174833, 38111, 235, 188568, 231, 5691, 12081, 13336, 2648, 29325, 14315, 24, 26, 24, 27, 24, 28, 24, 5123, 18372, 8391, 158343, 3512, 40071, 2196, 3236, 8750, 1764, 37097, 41168, 29721, 32797, 25646, 3802, 4975, 4975, 116167, 57178, 10251, 154048, 27292, 1767, 5125, 2632, 2155, 91, 2378, 1919, 1914, 2782, 19, 2155, 3354, 1933, 5470, 38, 2155, 52, 2068, 5470, 1767, 4961, 3059, 1894, 19, 2155, 43, 1933, 3026, 2725, 23186, 38, 2930, 14, 20676, 1671, 14, 83, 51]
chkhsh: 9c2227e4dd922002fb81bde4fc02b0483ca4f12911410dee2255e4987644e3f8
pre_tokenizer: {
    "type": "Sequence",
    "pretokenizers": [
        {
            "type": "Digits",
            "individual_digits": true
        },
        {
            "type": "ByteLevel",
            "add_prefix_space": false,
            "trim_offsets": true,
            "use_regex": true
        }
    ]
}


    def get_vocab_base_pre(self, tokenizer) -> str:
        # encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that
        # is specific for the BPE pre-tokenizer used by the model
        # we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can
        # use in llama.cpp to implement the same pre-tokenizer

        chktxt = '\n \n\n \n\n\n \t \t\t \t\n  \n   \n    \n     \n🚀 (normal) 😶\u200d🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български \'\'\'\'\'\'```````""""......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

        chktok = tokenizer.encode(chktxt)
        chkhsh = sha256(str(chktok).encode()).hexdigest()

        print(f"chktok: {chktok}")
        print(f"chkhsh: {chkhsh}")

        res = None

        # NOTE: if you get an error here, you need to update the convert-hf-to-gguf-update.py script
        #       or pull the latest version of the model from Huggingface
        #       don't edit the hashes manually!
        if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5":
            # ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B
            res = "llama-bpe"
        if chkhsh == "049ecf7629871e3041641907f3de7c733e4dbfdc736f57d882ba0b0845599754":
            # ref: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base
            res = "deepseek-llm"
        if chkhsh == "347715f544604f9118bb75ed199f68779f423cabb20db6de6f31b908d04d7821":
            # ref: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
            res = "deepseek-coder"
        if chkhsh == "8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed":
            # ref: https://huggingface.co/tiiuae/falcon-7b
            res = "falcon"
        if chkhsh == "0876d13b50744004aa9aeae05e7b0647eac9d801b5ba4668afc01e709c15e19f":
            # ref: https://huggingface.co/BAAI/bge-small-en-v1.5
            res = "bert-bge"
        if chkhsh == "b6dc8df998e1cfbdc4eac8243701a65afe638679230920b50d6f17d81c098166":
            # ref: https://huggingface.co/mosaicml/mpt-7b
            res = "mpt"
        if chkhsh == "35d91631860c815f952d711435f48d356ebac988362536bed955d43bfa436e34":
            # ref: https://huggingface.co/bigcode/starcoder2-3b
            res = "starcoder"
        if chkhsh == "3ce83efda5659b07b1ad37ca97ca5797ea4285d9b9ab0dc679e4a720c9da7454":
            # ref: https://huggingface.co/openai-community/gpt2
            res = "gpt-2"
        if chkhsh == "9c2227e4dd922002fb81bde4fc02b0483ca4f12911410dee2255e4987644e3f8":
            # ref: https://huggingface.co/CohereForAI/c4ai-command-r-v01
            res = "command-r"

        if res is None:
            print("\n")
            print("**************************************************************************************")
            print("** WARNING: The BPE pre-tokenizer was not recognized!")
            print("**          There are 2 possible reasons for this:")
            print("**          - the model has not been added to convert-hf-to-gguf-update.py yet")
            print("**          - the pre-tokenization config has changed upstream")
            print("**          Check your model files and convert-hf-to-gguf-update.py and update them accordingly.")
            print("** ref:     https://github.com/ggerganov/llama.cpp/pull/6920")
            print("**")
            print(f"** chkhsh:  {chkhsh}")
            print("**************************************************************************************")
            print("\n")
            raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

        print(f"tokenizer.ggml.pre: {res}")
        print(f"chkhsh: {chkhsh}")

        return res



!!! Copy-paste the function above into convert-hf-to-gguf.py !!!


Tests for llama-spm written in ./models/ggml-vocab-llama-spm.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for llama-bpe written in ./models/ggml-vocab-llama-bpe.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for phi-3 written in ./models/ggml-vocab-phi-3.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for deepseek-llm written in ./models/ggml-vocab-deepseek-llm.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for deepseek-coder written in ./models/ggml-vocab-deepseek-coder.gguf.*
Tests for falcon written in ./models/ggml-vocab-falcon.gguf.*
Tests for bert-bge written in ./models/ggml-vocab-bert-bge.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for mpt written in ./models/ggml-vocab-mpt.gguf.*
Tests for starcoder written in ./models/ggml-vocab-starcoder.gguf.*
Tests for gpt-2 written in ./models/ggml-vocab-gpt-2.gguf.*
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tests for command-r written in ./models/ggml-vocab-command-r.gguf.*

Run the following commands to generate the vocab files for testing:

python3 convert-hf-to-gguf.py models/tokenizers/llama-spm/ --outfile models/ggml-vocab-llama-spm.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/llama-bpe/ --outfile models/ggml-vocab-llama-bpe.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/phi-3/ --outfile models/ggml-vocab-phi-3.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/deepseek-llm/ --outfile models/ggml-vocab-deepseek-llm.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/deepseek-coder/ --outfile models/ggml-vocab-deepseek-coder.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/falcon/ --outfile models/ggml-vocab-falcon.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/bert-bge/ --outfile models/ggml-vocab-bert-bge.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/mpt/ --outfile models/ggml-vocab-mpt.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/starcoder/ --outfile models/ggml-vocab-starcoder.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/gpt-2/ --outfile models/ggml-vocab-gpt-2.gguf --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/command-r/ --outfile models/ggml-vocab-command-r.gguf --vocab-only

@dranger003
Copy link
Contributor Author

Superseded by PR #7063.

@dranger003 dranger003 closed this May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Command-R GGUF conversion no longer working
5 participants