Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix : lookup word in vocab before doing BPE merges #7193

Merged
merged 8 commits into from
May 11, 2024

Conversation

tonyfettes
Copy link
Contributor

@tonyfettes tonyfettes commented May 10, 2024

For llama-3, I found there is an inconsistency between llama.cpp's tokenizer and Huggingface's tokenizers. Example:

 Việt

llama.cpp:

 11655 -> ' Vi'
 26298 -> 'ệ'
    83 -> 't'

Huggingface's tokenizers with tokenizer.json from llama-3:

101798

After comparing the implementation, it seems that Huggingface's tokenizers will try to lookup a split word in the vocabulary first, and push to the result tokens if found; if not, it will try to merge the word at byte level instead. In llama.cpp, we always do the byte-level merge, hence the inconsistency.

This is a simple fix to the problem, by just looking the word up before do the merging.

PS: I have checked with tiktoken and it seems they did the same thing at src/lib.rs:228 in CoreBPE::_encode_native

PPS: I searched tokenizer.json from all BPE models (some are license-walled so I checked their variants) and it seems that llama-3 is the only one doing this?

Model tokenizer.json
DBRX (Walled) https://huggingface.co/turboderp/dbrx-instruct-exl2/tree/2.3bpw
Deepseek LLM https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat/raw/main/tokenizer.json
Deepseek Coder https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/raw/main/tokenizer.json
Falcon https://huggingface.co/tiiuae/falcon-7b/raw/main/tokenizer.json
Starcoder https://huggingface.co/bigcode/starcoder/raw/main/tokenizer.json
Refact https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/main/tokenizer.json
Command R+ https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json
GPT2 https://huggingface.co/openai-community/gpt2/raw/main/tokenizer.json
OLMo https://huggingface.co/allenai/OLMo-7B-Instruct/raw/main/tokenizer.json
Qwen2 (Qwen1.5) https://huggingface.co/Qwen/Qwen1.5-110B-Chat/raw/main/tokenizer.json

@tonyfettes tonyfettes marked this pull request as draft May 10, 2024 08:05
@tonyfettes tonyfettes changed the title Llama3 tokenizer ignore merge fix : lookup word in vocab before doing BPE merges May 10, 2024
@tonyfettes tonyfettes marked this pull request as ready for review May 10, 2024 08:48
@mofosyne mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level bugfix fixes an issue or bug labels May 10, 2024
@mofosyne mofosyne requested a review from goerch May 10, 2024 10:27
llama.cpp Outdated Show resolved Hide resolved
@tonyfettes tonyfettes force-pushed the llama3-tokenizer-ignore-merge branch from 4ba2e5c to 63207d1 Compare May 10, 2024 13:02
@ggerganov
Copy link
Owner

This change not only fixed the llama3 tokenization, but it also improved the performance by a factor of x4:

./tests/test-tokenizer-0.sh llama-bpe ./build/wikitext-2-raw/wiki.train.raw
  • master
Testing llama-bpe on ./build/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 3141.467 ms (py)
main : tokenized in 6085.319 ms (cpp)
1842692c1842692,1842694
< 101798
---
> 11655
> 26298
> 83
Tokenization differs!
  • PR
Testing llama-bpe on ./build/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 3157.516 ms (py)
main : tokenized in 1408.991 ms (cpp)
Tokenization is correct!

We now tokenize wiki.train.raw 2x faster than Python AutoTokenizer

@ggerganov
Copy link
Owner

PPS: I searched tokenizer.json from all BPE models (some are license-walled so I checked their variants) and it seems that llama-3 is the only one doing this?

Which parameter in the tokenizer config determines this behaviour?

@tonyfettes
Copy link
Contributor Author

@ggerganov "ignore_merges", under "model"

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge after the green is CI

@tonyfettes tonyfettes force-pushed the llama3-tokenizer-ignore-merge branch from 0f48f9e to 0c9a0ae Compare May 11, 2024 01:52
llama.cpp Outdated
Comment on lines 12302 to 12315
if (ignore_merges && vocab.token_to_id.find(word) != vocab.token_to_id.end()) {
llm_symbol sym;
sym.text = word.c_str();
sym.n = word.size();
sym.prev = final_prev_index;
sym.next = -1;
if (final_prev_index != -1) {
symbols_final[final_prev_index].next = symbols_final.size();
}
symbols_final.emplace_back(sym);
final_prev_index = symbols_final.size() - 1;
continue;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's apply @jaime-m-p's suggestion here, to reduce the code duplication in this loop:

#6965 (comment)

@ggerganov ggerganov merged commit f99e1e4 into ggerganov:master May 11, 2024
54 of 60 checks passed
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8538.14ms p(95)=21008.61ms fails=, finish reason: stop=490 truncated=61
  • Prompt processing (pp): avg=104.75tk/s p(95)=461.86tk/s
  • Token generation (tg): avg=34.2tk/s p(95)=49.2tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=llama3-tokenizer-ignore-merge commit=b8d3cd5337bfa74f816138af84e7181c5208f717

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 334.04, 334.04, 334.04, 334.04, 334.04, 539.06, 539.06, 539.06, 539.06, 539.06, 540.1, 540.1, 540.1, 540.1, 540.1, 568.12, 568.12, 568.12, 568.12, 568.12, 600.22, 600.22, 600.22, 600.22, 600.22, 664.0, 664.0, 664.0, 664.0, 664.0, 669.42, 669.42, 669.42, 669.42, 669.42, 687.3, 687.3, 687.3, 687.3, 687.3, 704.77, 704.77, 704.77, 704.77, 704.77, 705.71, 705.71, 705.71, 705.71, 705.71, 725.62, 725.62, 725.62, 725.62, 725.62, 750.53, 750.53, 750.53, 750.53, 750.53, 782.72, 782.72, 782.72, 782.72, 782.72, 797.23, 797.23, 797.23, 797.23, 797.23, 708.58, 708.58, 708.58, 708.58, 708.58, 718.95, 718.95, 718.95, 718.95, 718.95, 725.3, 725.3, 725.3, 725.3, 725.3, 750.02, 750.02, 750.02, 750.02, 750.02, 753.58, 753.58, 753.58, 753.58, 753.58, 755.08, 755.08, 755.08, 755.08, 755.08, 761.87, 761.87, 761.87, 761.87, 761.87, 768.28, 768.28, 768.28, 768.28, 768.28, 779.72, 779.72, 779.72, 779.72, 779.72, 772.27, 772.27, 772.27, 772.27, 772.27, 778.04, 778.04, 778.04, 778.04, 778.04, 793.89, 793.89, 793.89, 793.89, 793.89, 792.22, 792.22, 792.22, 792.22, 792.22, 790.81, 790.81, 790.81, 790.81, 790.81, 791.91, 791.91, 791.91, 791.91, 791.91, 797.01, 797.01, 797.01, 797.01, 797.01, 798.59, 798.59, 798.59, 798.59, 798.59, 796.94, 796.94, 796.94, 796.94, 796.94, 800.09, 800.09, 800.09, 800.09, 800.09, 812.08, 812.08, 812.08, 812.08, 812.08, 817.73, 817.73, 817.73, 817.73, 817.73, 828.4, 828.4, 828.4, 828.4, 828.4, 827.68, 827.68, 827.68, 827.68, 827.68, 825.77, 825.77, 825.77, 825.77, 825.77, 828.25, 828.25, 828.25, 828.25, 828.25, 832.11, 832.11, 832.11, 832.11, 832.11, 837.64, 837.64, 837.64, 837.64, 837.64, 848.16, 848.16, 848.16, 848.16, 848.16, 833.41, 833.41, 833.41, 833.41, 833.41, 832.42, 832.42, 832.42, 832.42, 832.42, 830.71, 830.71, 830.71, 830.71, 830.71, 834.03, 834.03, 834.03, 834.03, 834.03, 833.95, 833.95, 833.95, 833.95, 833.95, 835.53, 835.53, 835.53, 835.53, 835.53, 838.98, 838.98, 838.98, 838.98, 838.98, 841.39, 841.39, 841.39, 841.39, 841.39, 843.69, 843.69, 843.69, 843.69, 843.69, 848.57, 848.57, 848.57, 848.57, 848.57, 847.22, 847.22, 847.22, 847.22, 847.22, 851.13, 851.13, 851.13, 851.13, 851.13, 852.54, 852.54, 852.54, 852.54, 852.54, 853.08, 853.08, 853.08, 853.08, 853.08, 852.6, 852.6, 852.6, 852.6, 852.6, 853.59, 853.59, 853.59, 853.59, 853.59, 854.47, 854.47, 854.47, 854.47, 854.47, 857.58, 857.58, 857.58, 857.58, 857.58, 858.06, 858.06, 858.06, 858.06, 858.06, 858.06]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.94, 37.94, 37.94, 37.94, 37.94, 42.61, 42.61, 42.61, 42.61, 42.61, 34.13, 34.13, 34.13, 34.13, 34.13, 30.78, 30.78, 30.78, 30.78, 30.78, 31.39, 31.39, 31.39, 31.39, 31.39, 31.7, 31.7, 31.7, 31.7, 31.7, 32.86, 32.86, 32.86, 32.86, 32.86, 33.5, 33.5, 33.5, 33.5, 33.5, 34.0, 34.0, 34.0, 34.0, 34.0, 33.88, 33.88, 33.88, 33.88, 33.88, 33.87, 33.87, 33.87, 33.87, 33.87, 34.01, 34.01, 34.01, 34.01, 34.01, 33.91, 33.91, 33.91, 33.91, 33.91, 33.11, 33.11, 33.11, 33.11, 33.11, 32.43, 32.43, 32.43, 32.43, 32.43, 32.04, 32.04, 32.04, 32.04, 32.04, 32.14, 32.14, 32.14, 32.14, 32.14, 32.42, 32.42, 32.42, 32.42, 32.42, 32.07, 32.07, 32.07, 32.07, 32.07, 32.03, 32.03, 32.03, 32.03, 32.03, 31.85, 31.85, 31.85, 31.85, 31.85, 31.82, 31.82, 31.82, 31.82, 31.82, 31.92, 31.92, 31.92, 31.92, 31.92, 31.76, 31.76, 31.76, 31.76, 31.76, 32.09, 32.09, 32.09, 32.09, 32.09, 32.08, 32.08, 32.08, 32.08, 32.08, 31.74, 31.74, 31.74, 31.74, 31.74, 31.3, 31.3, 31.3, 31.3, 31.3, 31.34, 31.34, 31.34, 31.34, 31.34, 31.54, 31.54, 31.54, 31.54, 31.54, 31.56, 31.56, 31.56, 31.56, 31.56, 31.69, 31.69, 31.69, 31.69, 31.69, 31.72, 31.72, 31.72, 31.72, 31.72, 31.68, 31.68, 31.68, 31.68, 31.68, 31.67, 31.67, 31.67, 31.67, 31.67, 31.51, 31.51, 31.51, 31.51, 31.51, 31.05, 31.05, 31.05, 31.05, 31.05, 31.06, 31.06, 31.06, 31.06, 31.06, 31.27, 31.27, 31.27, 31.27, 31.27, 31.37, 31.37, 31.37, 31.37, 31.37, 31.46, 31.46, 31.46, 31.46, 31.46, 31.29, 31.29, 31.29, 31.29, 31.29, 31.07, 31.07, 31.07, 31.07, 31.07, 30.67, 30.67, 30.67, 30.67, 30.67, 29.85, 29.85, 29.85, 29.85, 29.85, 29.57, 29.57, 29.57, 29.57, 29.57, 29.52, 29.52, 29.52, 29.52, 29.52, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.63, 29.63, 29.63, 29.63, 29.63, 29.76, 29.76, 29.76, 29.76, 29.76, 29.77, 29.77, 29.77, 29.77, 29.77, 29.72, 29.72, 29.72, 29.72, 29.72, 29.53, 29.53, 29.53, 29.53, 29.53, 29.49, 29.49, 29.49, 29.49, 29.49, 29.67, 29.67, 29.67, 29.67, 29.67, 29.7, 29.7, 29.7, 29.7, 29.7, 29.87, 29.87, 29.87, 29.87, 29.87, 30.0, 30.0, 30.0, 30.0, 30.0, 30.04, 30.04, 30.04, 30.04, 30.04, 30.05, 30.05, 30.05, 30.05, 30.05, 30.1]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.42, 0.42, 0.42, 0.42, 0.42, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.35, 0.35, 0.35, 0.35, 0.35, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.37, 0.37, 0.37, 0.37, 0.37, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.43, 0.43, 0.43, 0.43, 0.43, 0.62, 0.62, 0.62, 0.62, 0.62, 0.54, 0.54, 0.54, 0.54, 0.54, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.3, 0.3, 0.3, 0.3, 0.3, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0]
                    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants