Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phi3 : duplicate rope factors in each layer #7447

Merged
merged 2 commits into from
May 22, 2024
Merged

phi3 : duplicate rope factors in each layer #7447

merged 2 commits into from
May 22, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented May 21, 2024

GPU Model Test t/s master t/s sl/phi3-fix Speedup
RTX 3090 Ti phi3 14B Q8_0 pp512 1655.80 2359.53 1.43
RTX 3090 Ti phi3 14B Q8_0 tg128 16.97 53.37 3.14

phi3 : set phi-3 model type as 14B

model loader : simplify the process for duplicating model tensors

llama-bench : remove default pg test
Copy link
Contributor

github-actions bot commented May 21, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 533 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8802.21ms p(95)=22484.64ms fails=, finish reason: stop=483 truncated=50
  • Prompt processing (pp): avg=98.96tk/s p(95)=442.33tk/s
  • Token generation (tg): avg=45.6tk/s p(95)=45.83tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sl/phi3-fix commit=ef8e9e72b45c559dd948ace8aa7519ef6fd59b2e

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 320.04, 320.04, 320.04, 320.04, 320.04, 660.94, 660.94, 660.94, 660.94, 660.94, 654.49, 654.49, 654.49, 654.49, 654.49, 682.9, 682.9, 682.9, 682.9, 682.9, 743.59, 743.59, 743.59, 743.59, 743.59, 750.38, 750.38, 750.38, 750.38, 750.38, 752.81, 752.81, 752.81, 752.81, 752.81, 766.24, 766.24, 766.24, 766.24, 766.24, 771.93, 771.93, 771.93, 771.93, 771.93, 787.52, 787.52, 787.52, 787.52, 787.52, 817.32, 817.32, 817.32, 817.32, 817.32, 814.08, 814.08, 814.08, 814.08, 814.08, 852.21, 852.21, 852.21, 852.21, 852.21, 842.92, 842.92, 842.92, 842.92, 842.92, 865.56, 865.56, 865.56, 865.56, 865.56, 875.29, 875.29, 875.29, 875.29, 875.29, 874.89, 874.89, 874.89, 874.89, 874.89, 878.16, 878.16, 878.16, 878.16, 878.16, 872.92, 872.92, 872.92, 872.92, 872.92, 870.69, 870.69, 870.69, 870.69, 870.69, 843.91, 843.91, 843.91, 843.91, 843.91, 821.55, 821.55, 821.55, 821.55, 821.55, 819.51, 819.51, 819.51, 819.51, 819.51, 794.76, 794.76, 794.76, 794.76, 794.76, 798.01, 798.01, 798.01, 798.01, 798.01, 818.16, 818.16, 818.16, 818.16, 818.16, 816.89, 816.89, 816.89, 816.89, 816.89, 818.08, 818.08, 818.08, 818.08, 818.08, 831.12, 831.12, 831.12, 831.12, 831.12, 831.03, 831.03, 831.03, 831.03, 831.03, 827.9, 827.9, 827.9, 827.9, 827.9, 832.53, 832.53, 832.53, 832.53, 832.53, 833.78, 833.78, 833.78, 833.78, 833.78, 832.21, 832.21, 832.21, 832.21, 832.21, 836.73, 836.73, 836.73, 836.73, 836.73, 838.27, 838.27, 838.27, 838.27, 838.27, 840.88, 840.88, 840.88, 840.88, 840.88, 839.69, 839.69, 839.69, 839.69, 839.69, 818.33, 818.33, 818.33, 818.33, 818.33, 816.52, 816.52, 816.52, 816.52, 816.52, 816.26, 816.26, 816.26, 816.26, 816.26, 821.76, 821.76, 821.76, 821.76, 821.76, 822.71, 822.71, 822.71, 822.71, 822.71, 814.0, 814.0, 814.0, 814.0, 814.0, 813.42, 813.42, 813.42, 813.42, 813.42, 814.96, 814.96, 814.96, 814.96, 814.96, 814.88, 814.88, 814.88, 814.88, 814.88, 812.99, 812.99, 812.99, 812.99, 812.99, 812.49, 812.49, 812.49, 812.49, 812.49, 811.98, 811.98, 811.98, 811.98, 811.98, 814.62, 814.62, 814.62, 814.62, 814.62, 817.16, 817.16, 817.16, 817.16, 817.16, 819.93, 819.93, 819.93, 819.93, 819.93, 825.49, 825.49, 825.49, 825.49, 825.49, 821.0, 821.0, 821.0, 821.0, 821.0, 827.38, 827.38, 827.38, 827.38, 827.38, 828.57, 828.57, 828.57, 828.57, 828.57, 828.17, 828.17, 828.17, 828.17, 828.17, 828.7, 828.7, 828.7, 828.7, 828.7, 828.67, 828.67, 828.67, 828.67, 828.67, 827.23, 827.23, 827.23, 827.23, 827.23]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.22, 41.22, 41.22, 41.22, 41.22, 42.32, 42.32, 42.32, 42.32, 42.32, 30.86, 30.86, 30.86, 30.86, 30.86, 29.39, 29.39, 29.39, 29.39, 29.39, 31.24, 31.24, 31.24, 31.24, 31.24, 32.0, 32.0, 32.0, 32.0, 32.0, 33.35, 33.35, 33.35, 33.35, 33.35, 33.97, 33.97, 33.97, 33.97, 33.97, 34.06, 34.06, 34.06, 34.06, 34.06, 34.21, 34.21, 34.21, 34.21, 34.21, 34.58, 34.58, 34.58, 34.58, 34.58, 34.55, 34.55, 34.55, 34.55, 34.55, 33.7, 33.7, 33.7, 33.7, 33.7, 33.31, 33.31, 33.31, 33.31, 33.31, 32.31, 32.31, 32.31, 32.31, 32.31, 31.34, 31.34, 31.34, 31.34, 31.34, 30.81, 30.81, 30.81, 30.81, 30.81, 29.07, 29.07, 29.07, 29.07, 29.07, 28.91, 28.91, 28.91, 28.91, 28.91, 29.12, 29.12, 29.12, 29.12, 29.12, 29.33, 29.33, 29.33, 29.33, 29.33, 29.09, 29.09, 29.09, 29.09, 29.09, 28.81, 28.81, 28.81, 28.81, 28.81, 28.84, 28.84, 28.84, 28.84, 28.84, 29.0, 29.0, 29.0, 29.0, 29.0, 29.29, 29.29, 29.29, 29.29, 29.29, 29.37, 29.37, 29.37, 29.37, 29.37, 29.72, 29.72, 29.72, 29.72, 29.72, 29.62, 29.62, 29.62, 29.62, 29.62, 29.6, 29.6, 29.6, 29.6, 29.6, 29.74, 29.74, 29.74, 29.74, 29.74, 29.92, 29.92, 29.92, 29.92, 29.92, 30.04, 30.04, 30.04, 30.04, 30.04, 30.19, 30.19, 30.19, 30.19, 30.19, 30.39, 30.39, 30.39, 30.39, 30.39, 30.29, 30.29, 30.29, 30.29, 30.29, 30.18, 30.18, 30.18, 30.18, 30.18, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.57, 29.57, 29.57, 29.57, 29.57, 29.73, 29.73, 29.73, 29.73, 29.73, 29.9, 29.9, 29.9, 29.9, 29.9, 29.99, 29.99, 29.99, 29.99, 29.99, 30.13, 30.13, 30.13, 30.13, 30.13, 29.96, 29.96, 29.96, 29.96, 29.96, 29.93, 29.93, 29.93, 29.93, 29.93, 28.9, 28.9, 28.9, 28.9, 28.9, 28.6, 28.6, 28.6, 28.6, 28.6, 28.59, 28.59, 28.59, 28.59, 28.59, 28.6, 28.6, 28.6, 28.6, 28.6, 28.7, 28.7, 28.7, 28.7, 28.7, 28.76, 28.76, 28.76, 28.76, 28.76, 28.81, 28.81, 28.81, 28.81, 28.81, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.82, 28.82, 28.82, 28.82, 28.82, 28.87, 28.87, 28.87, 28.87, 28.87, 29.03, 29.03, 29.03, 29.03, 29.03, 29.1, 29.1, 29.1, 29.1, 29.1, 29.23, 29.23, 29.23, 29.23, 29.23, 29.27, 29.27, 29.27, 29.27, 29.27]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.37, 0.37, 0.37, 0.37, 0.37, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.33, 0.33, 0.33, 0.33, 0.33, 0.34, 0.34, 0.34, 0.34, 0.34, 0.38, 0.38, 0.38, 0.38, 0.38, 0.39, 0.39, 0.39, 0.39, 0.39, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.37, 0.37, 0.37, 0.37, 0.37, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.33, 0.33, 0.33, 0.33, 0.33, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.33, 0.33, 0.33, 0.33, 0.33, 0.32, 0.32, 0.32, 0.32, 0.32, 0.34, 0.34, 0.34, 0.34, 0.34, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.42, 0.42, 0.42, 0.42, 0.42, 0.49, 0.49, 0.49, 0.49, 0.49, 0.51, 0.51, 0.51, 0.51, 0.51, 0.5, 0.5, 0.5, 0.5, 0.5, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    

llama.cpp Outdated
model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
ml.n_created--; // artificial tensor
ml.size_data += ggml_nbytes(model.output);
model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the intention to make the duplicated argument true? I assume that yes, because then the old behavior would be kept.

This seems like required is set to true and leaves duplicated to false.

(this also applies to the other places where model.output is initialized from the token_embd tensor)

Suggested change
model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true);
model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true, true);

Copy link
Collaborator Author

@slaren slaren May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have replaced the boolean parameters with named flags that should make these errors easier to avoid in the future.

@mofosyne mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level model Model specific labels May 22, 2024
@slaren slaren merged commit b18532a into master May 22, 2024
72 of 84 checks passed
@slaren slaren deleted the sl/phi3-fix branch May 22, 2024 14:10
Nexesenex pushed a commit to Nexesenex/kobold.cpp that referenced this pull request May 22, 2024
* phi3 : duplicate rope factors in each layer

phi3 : set phi-3 model type as 14B

model loader : simplify the process for duplicating model tensors

llama-bench : remove default pg test

* replace bool parameters in llama_model_loader with named flags
Nexesenex pushed a commit to Nexesenex/kobold.cpp that referenced this pull request May 22, 2024
* phi3 : duplicate rope factors in each layer

phi3 : set phi-3 model type as 14B

model loader : simplify the process for duplicating model tensors

llama-bench : remove default pg test

* replace bool parameters in llama_model_loader with named flags
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 23, 2024
* phi3 : duplicate rope factors in each layer

phi3 : set phi-3 model type as 14B

model loader : simplify the process for duplicating model tensors

llama-bench : remove default pg test

* replace bool parameters in llama_model_loader with named flags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples model Model specific review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants