phi3 : duplicate rope factors in each layer #7447

slaren · 2024-05-21T21:17:36Z

GPU	Model	Test	t/s master	t/s sl/phi3-fix	Speedup
RTX 3090 Ti	phi3 14B Q8_0	pp512	1655.80	2359.53	1.43
RTX 3090 Ti	phi3 14B Q8_0	tg128	16.97	53.37	3.14

phi3 : set phi-3 model type as 14B model loader : simplify the process for duplicating model tensors llama-bench : remove default pg test

github-actions · 2024-05-21T23:53:15Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 533 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8802.21ms p(95)=22484.64ms fails=, finish reason: stop=483 truncated=50
Prompt processing (pp): avg=98.96tk/s p(95)=442.33tk/s
Token generation (tg): avg=45.6tk/s p(95)=45.83tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sl/phi3-fix commit=ef8e9e72b45c559dd948ace8aa7519ef6fd59b2e

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 320.04, 320.04, 320.04, 320.04, 320.04, 660.94, 660.94, 660.94, 660.94, 660.94, 654.49, 654.49, 654.49, 654.49, 654.49, 682.9, 682.9, 682.9, 682.9, 682.9, 743.59, 743.59, 743.59, 743.59, 743.59, 750.38, 750.38, 750.38, 750.38, 750.38, 752.81, 752.81, 752.81, 752.81, 752.81, 766.24, 766.24, 766.24, 766.24, 766.24, 771.93, 771.93, 771.93, 771.93, 771.93, 787.52, 787.52, 787.52, 787.52, 787.52, 817.32, 817.32, 817.32, 817.32, 817.32, 814.08, 814.08, 814.08, 814.08, 814.08, 852.21, 852.21, 852.21, 852.21, 852.21, 842.92, 842.92, 842.92, 842.92, 842.92, 865.56, 865.56, 865.56, 865.56, 865.56, 875.29, 875.29, 875.29, 875.29, 875.29, 874.89, 874.89, 874.89, 874.89, 874.89, 878.16, 878.16, 878.16, 878.16, 878.16, 872.92, 872.92, 872.92, 872.92, 872.92, 870.69, 870.69, 870.69, 870.69, 870.69, 843.91, 843.91, 843.91, 843.91, 843.91, 821.55, 821.55, 821.55, 821.55, 821.55, 819.51, 819.51, 819.51, 819.51, 819.51, 794.76, 794.76, 794.76, 794.76, 794.76, 798.01, 798.01, 798.01, 798.01, 798.01, 818.16, 818.16, 818.16, 818.16, 818.16, 816.89, 816.89, 816.89, 816.89, 816.89, 818.08, 818.08, 818.08, 818.08, 818.08, 831.12, 831.12, 831.12, 831.12, 831.12, 831.03, 831.03, 831.03, 831.03, 831.03, 827.9, 827.9, 827.9, 827.9, 827.9, 832.53, 832.53, 832.53, 832.53, 832.53, 833.78, 833.78, 833.78, 833.78, 833.78, 832.21, 832.21, 832.21, 832.21, 832.21, 836.73, 836.73, 836.73, 836.73, 836.73, 838.27, 838.27, 838.27, 838.27, 838.27, 840.88, 840.88, 840.88, 840.88, 840.88, 839.69, 839.69, 839.69, 839.69, 839.69, 818.33, 818.33, 818.33, 818.33, 818.33, 816.52, 816.52, 816.52, 816.52, 816.52, 816.26, 816.26, 816.26, 816.26, 816.26, 821.76, 821.76, 821.76, 821.76, 821.76, 822.71, 822.71, 822.71, 822.71, 822.71, 814.0, 814.0, 814.0, 814.0, 814.0, 813.42, 813.42, 813.42, 813.42, 813.42, 814.96, 814.96, 814.96, 814.96, 814.96, 814.88, 814.88, 814.88, 814.88, 814.88, 812.99, 812.99, 812.99, 812.99, 812.99, 812.49, 812.49, 812.49, 812.49, 812.49, 811.98, 811.98, 811.98, 811.98, 811.98, 814.62, 814.62, 814.62, 814.62, 814.62, 817.16, 817.16, 817.16, 817.16, 817.16, 819.93, 819.93, 819.93, 819.93, 819.93, 825.49, 825.49, 825.49, 825.49, 825.49, 821.0, 821.0, 821.0, 821.0, 821.0, 827.38, 827.38, 827.38, 827.38, 827.38, 828.57, 828.57, 828.57, 828.57, 828.57, 828.17, 828.17, 828.17, 828.17, 828.17, 828.7, 828.7, 828.7, 828.7, 828.7, 828.67, 828.67, 828.67, 828.67, 828.67, 827.23, 827.23, 827.23, 827.23, 827.23]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.22, 41.22, 41.22, 41.22, 41.22, 42.32, 42.32, 42.32, 42.32, 42.32, 30.86, 30.86, 30.86, 30.86, 30.86, 29.39, 29.39, 29.39, 29.39, 29.39, 31.24, 31.24, 31.24, 31.24, 31.24, 32.0, 32.0, 32.0, 32.0, 32.0, 33.35, 33.35, 33.35, 33.35, 33.35, 33.97, 33.97, 33.97, 33.97, 33.97, 34.06, 34.06, 34.06, 34.06, 34.06, 34.21, 34.21, 34.21, 34.21, 34.21, 34.58, 34.58, 34.58, 34.58, 34.58, 34.55, 34.55, 34.55, 34.55, 34.55, 33.7, 33.7, 33.7, 33.7, 33.7, 33.31, 33.31, 33.31, 33.31, 33.31, 32.31, 32.31, 32.31, 32.31, 32.31, 31.34, 31.34, 31.34, 31.34, 31.34, 30.81, 30.81, 30.81, 30.81, 30.81, 29.07, 29.07, 29.07, 29.07, 29.07, 28.91, 28.91, 28.91, 28.91, 28.91, 29.12, 29.12, 29.12, 29.12, 29.12, 29.33, 29.33, 29.33, 29.33, 29.33, 29.09, 29.09, 29.09, 29.09, 29.09, 28.81, 28.81, 28.81, 28.81, 28.81, 28.84, 28.84, 28.84, 28.84, 28.84, 29.0, 29.0, 29.0, 29.0, 29.0, 29.29, 29.29, 29.29, 29.29, 29.29, 29.37, 29.37, 29.37, 29.37, 29.37, 29.72, 29.72, 29.72, 29.72, 29.72, 29.62, 29.62, 29.62, 29.62, 29.62, 29.6, 29.6, 29.6, 29.6, 29.6, 29.74, 29.74, 29.74, 29.74, 29.74, 29.92, 29.92, 29.92, 29.92, 29.92, 30.04, 30.04, 30.04, 30.04, 30.04, 30.19, 30.19, 30.19, 30.19, 30.19, 30.39, 30.39, 30.39, 30.39, 30.39, 30.29, 30.29, 30.29, 30.29, 30.29, 30.18, 30.18, 30.18, 30.18, 30.18, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.83, 29.57, 29.57, 29.57, 29.57, 29.57, 29.73, 29.73, 29.73, 29.73, 29.73, 29.9, 29.9, 29.9, 29.9, 29.9, 29.99, 29.99, 29.99, 29.99, 29.99, 30.13, 30.13, 30.13, 30.13, 30.13, 29.96, 29.96, 29.96, 29.96, 29.96, 29.93, 29.93, 29.93, 29.93, 29.93, 28.9, 28.9, 28.9, 28.9, 28.9, 28.6, 28.6, 28.6, 28.6, 28.6, 28.59, 28.59, 28.59, 28.59, 28.59, 28.6, 28.6, 28.6, 28.6, 28.6, 28.7, 28.7, 28.7, 28.7, 28.7, 28.76, 28.76, 28.76, 28.76, 28.76, 28.81, 28.81, 28.81, 28.81, 28.81, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.82, 28.82, 28.82, 28.82, 28.82, 28.87, 28.87, 28.87, 28.87, 28.87, 29.03, 29.03, 29.03, 29.03, 29.03, 29.1, 29.1, 29.1, 29.1, 29.1, 29.23, 29.23, 29.23, 29.23, 29.23, 29.27, 29.27, 29.27, 29.27, 29.27]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.37, 0.37, 0.37, 0.37, 0.37, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.33, 0.33, 0.33, 0.33, 0.33, 0.34, 0.34, 0.34, 0.34, 0.34, 0.38, 0.38, 0.38, 0.38, 0.38, 0.39, 0.39, 0.39, 0.39, 0.39, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.37, 0.37, 0.37, 0.37, 0.37, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.33, 0.33, 0.33, 0.33, 0.33, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.33, 0.33, 0.33, 0.33, 0.33, 0.32, 0.32, 0.32, 0.32, 0.32, 0.34, 0.34, 0.34, 0.34, 0.34, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.42, 0.42, 0.42, 0.42, 0.42, 0.49, 0.49, 0.49, 0.49, 0.49, 0.51, 0.51, 0.51, 0.51, 0.51, 0.5, 0.5, 0.5, 0.5, 0.5, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716381619 --> 1716382247
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0]

compilade · 2024-05-22T03:58:52Z

llama.cpp

-                                model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
-                                ml.n_created--; // artificial tensor
-                                ml.size_data += ggml_nbytes(model.output);
+                                model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true);


Was the intention to make the duplicated argument true? I assume that yes, because then the old behavior would be kept.

This seems like required is set to true and leaves duplicated to false.

(this also applies to the other places where model.output is initialized from the token_embd tensor)

Suggested change

model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true);

model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true, true);

Thanks, I have replaced the boolean parameters with named flags that should make these errors easier to avoid in the future.

ggml-ci

* phi3 : duplicate rope factors in each layer phi3 : set phi-3 model type as 14B model loader : simplify the process for duplicating model tensors llama-bench : remove default pg test * replace bool parameters in llama_model_loader with named flags

phi3 : duplicate rope factors in each layer

477973d

phi3 : set phi-3 model type as 14B model loader : simplify the process for duplicating model tensors llama-bench : remove default pg test

github-actions bot added the examples label May 21, 2024

compilade reviewed May 22, 2024

View reviewed changes

mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level model Model specific labels May 22, 2024

replace bool parameters with named flags

ef8e9e7

ggml-ci

ggerganov approved these changes May 22, 2024

View reviewed changes

slaren merged commit b18532a into master May 22, 2024
72 of 84 checks passed

slaren deleted the sl/phi3-fix branch May 22, 2024 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phi3 : duplicate rope factors in each layer #7447

phi3 : duplicate rope factors in each layer #7447

slaren commented May 21, 2024 •

edited

github-actions bot commented May 21, 2024 •

edited

compilade May 22, 2024

slaren May 22, 2024 •

edited

	model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true);
	model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, true, true);

phi3 : duplicate rope factors in each layer #7447

phi3 : duplicate rope factors in each layer #7447

Conversation

slaren commented May 21, 2024 • edited

github-actions bot commented May 21, 2024 • edited

compilade May 22, 2024

Choose a reason for hiding this comment

slaren May 22, 2024 • edited

Choose a reason for hiding this comment

slaren commented May 21, 2024 •

edited

github-actions bot commented May 21, 2024 •

edited

slaren May 22, 2024 •

edited