Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results #89

Open
ggerganov opened this issue Oct 25, 2022 · 158 comments
Open

Benchmark results #89

ggerganov opened this issue Oct 25, 2022 · 158 comments
Labels
performance CPU and memory usage - results and comparisons

Comments

@ggerganov
Copy link
Owner

ggerganov commented Oct 25, 2022

Encoder

Collection of bench results for various platforms and devices.
If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU OS Config Model Th Load Enc. Commit
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 8 71 102 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 8 96 220 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 8 233 685 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 8 603 1928 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 8 1158 3350 206fc93
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 1 251 2605 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 255 884 206fc93
---
Mac Mini M1 MacOS NEON BLAS tiny 4 62 194 fcf515d
Mac Mini M1 MacOS NEON BLAS base 4 81 380 fcf515d
Mac Mini M1 MacOS NEON BLAS small 4 204 1249 fcf515d
Mac Mini M1 MacOS NEON BLAS medium 4 876 3980 fcf515d
Mac Mini M1 MacOS NEON BLAS large 4 1876 7979 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 8 107 422 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 8 137 880 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 8 280 2874 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 8 692 9610 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 8 1317 16917 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS tiny 4 120 780 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS base 4 151 1173 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS small 4 289 3062 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS medium 4 711 9175 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS large 4 1282 16050 fcf515d
---
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 8 135 197 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 8 176 421 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 8 357 1393 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 8 855 4404 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 8 1576 8118 fcf515d
---
Raspberry Pi 4 NEON tiny 4 1436 13839 fcf515d
Raspberry Pi 4 NEON base 4 1894 30552 fcf515d
---
iPhone 13 Mini iOS 16.0 NEON BLAS base 4 97 1091 fcf515d
---
MacBook M1 Pro Vivaldi WASM tiny 8 133 3785 fcf515d
MacBook M1 Pro Vivaldi WASM base 8 172 8253 fcf515d
---
MacBook M1 Pro Chrome WASM tiny 8 134 3776 fcf515d
MacBook M1 Pro Chrome WASM base 8 168 8200 fcf515d
---
MacBook M1 Pro Firefox WASM tiny 8 137 2626 fcf515d
MacBook M1 Pro Firefox WASM base 8 183 6226 fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)
@ggerganov ggerganov added the performance CPU and memory usage - results and comparisons label Oct 25, 2022
@cdosoftei
Copy link
Contributor

cdosoftei commented Oct 25, 2022

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-4790K Debian   tiny.en 4 165 808
i7-4790K Debian   tiny.en 8 165 783
i7-4790K Debian   base.en 4 212 1813
i7-4790K Debian   base.en 8 214 1746

@rjwilmsi
Copy link

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 4 170.00 829.43
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 6 143.03 671.74
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 4 305.92 2,092.39
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 6 188.05 1,495.61
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 4 408.03 6,919.31
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 6 359.23 6,370.83
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 4 2,238.11 25,863.28
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 6 1,113.04 19,672.63
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 8 973.65 39,619.20

@ArtyomZemlyak
Copy link

ArtyomZemlyak commented Oct 26, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 tiny 2 164.35 1087.61
i7-11800H WSL2 Ubuntu AVX2 tiny 4 128.94 733.24
i7-11800H WSL2 Ubuntu AVX2 tiny 8 137.57 619.88
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 2 143.02 1087.15
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 4 127.60 730.57
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 8 125.62 616.27
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 2 132.59 1511.38
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 4 132.48 1407.49
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 8 133.82 1458.27

@ArtyomZemlyak
Copy link

ArtyomZemlyak commented Oct 26, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 base 2 174.34 2533.79
i7-11800H WSL2 Ubuntu AVX2 base 4 166.68 1830.67
i7-11800H WSL2 Ubuntu AVX2 base 8 165.53 1478.73
i7-11800H WSL2 Ubuntu AVX2 small 2 340.12 8714.24
i7-11800H WSL2 Ubuntu AVX2 small 4 394.32 6021.41
i7-11800H WSL2 Ubuntu AVX2 small 8 305.98 4828.84
i7-11800H WSL2 Ubuntu AVX2 large 2 3205.36 57109.10
i7-11800H WSL2 Ubuntu AVX2 large 4 2720.25 38519.89
i7-11800H WSL2 Ubuntu AVX2 large 8 3716.34 27739.99

@ArtyomZemlyak
Copy link

ArtyomZemlyak commented Oct 26, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 2 1954.21 54966.84
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 4 1455.40 37320.62
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 8 1372.58 27937.64

@ArtyomZemlyak
Copy link

This performance is impressing!

M1 Pro | MacOS |   | large | 8 | 1973 | 4208

@ggerganov ggerganov pinned this issue Oct 26, 2022
@ggerganov
Copy link
Owner Author

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: #95

@cristianglezm
Copy link

cristianglezm commented Oct 28, 2022

CPU OS Config Model Threads Load[ms] encode[ms]
Intel® Core™ i5-8250U Win11 Home AVX2 Large 8 2226.85 61547.61

compiled with MinGW64 gcc 11.3

@tazz4843
Copy link
Contributor

tazz4843 commented Oct 29, 2022

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU OS Config Model Threads Load[ms] encode[ms]
AMD Custom APU 0405 SteamOS 3.2 AVX2 Base 8 326.32 2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

@yujinqiu
Copy link

CPU OS Config Model Threads Load [ms] Encode [ms]
MacBook M1 Max macOS Ventura BLAS small 1 299.09 4166.00
MacBook M1 Max macOS Ventura BLAS small 4 329.45 1304.32
MacBook M1 Max macOS Ventura BLAS base 1 139.10 1302.17
MacBook M1 Max macOS Ventura BLAS base 4 135.96 399.45

@trholding

This comment was marked as outdated.

@trholding

This comment was marked as outdated.

@trholding

This comment was marked as outdated.

@ggerganov
Copy link
Owner Author

@trholding
Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

@trholding

This comment was marked as outdated.

@trholding
Copy link
Contributor

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

@rgerganov
Copy link
Contributor

Dell Precision 5560 laptop results:

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11850H Ubuntu AVX2 tiny 4 115.87 538.43
i7-11850H Ubuntu AVX2 base 4 145.14 1241.84
i7-11850H Ubuntu AVX2 small 4 299.30 4343.57
i7-11850H Ubuntu AVX2 medium 4 760.98 15238.31
i7-11850H Ubuntu AVX2 large 4 1404.32 27476.86
i7-11850H Ubuntu AVX2 tiny 8 131.96 358.81
i7-11850H Ubuntu AVX2 base 8 166.61 839.31
i7-11850H Ubuntu AVX2 small 8 320.29 2854.86
i7-11850H Ubuntu AVX2 medium 8 756.20 9829.62
i7-11850H Ubuntu AVX2 large 8 1382.38 19872.81

@jaybinks
Copy link
Contributor

jaybinks commented Nov 5, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i9-9900K WSL2 Ubuntu (GCC) AVX2  tiny.en 4 85.71 601.56
i9-9900K WSL2 Ubuntu (GCC) AVX2  small.en 4 212.59 5146.23
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2  tiny.en 4 198.17 455.12
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2  base.en 4 272.62 909.71
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2 small.en 4 598.75 2968.75
Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl small.en 4 776.56 12340.41
Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl tiny.en 4 295.54 1710.46

@mark-beeby
Copy link

CPU OS Config Model Threads Load [ms] Encode [ms]
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 4 124.28 656.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 8 123.70 696.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 4 159.91 1754.44
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 8 164.47 1658.55
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 4 330.91 6161.86
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 8 346.22 5187.85

@niksedk
Copy link

niksedk commented Nov 9, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 - small.en 4 1,314.25 294,168.09

Compiled with VS 2022

Something is off, right?

@ggerganov
Copy link
Owner Author

Yup - you are missing the AVX2 flag. See if some of the comments in #5 can help you resolve this.

@niksedk
Copy link

niksedk commented Nov 9, 2022

OK, the AVX2 flag seems to help :)

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 AVX2 small.en 4 527.59 9,648.67

Compiled with VS 2022

@j1nx
Copy link

j1nx commented Nov 17, 2022

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 843.80 16145.62 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 824.24 13187.96 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 1103.39 52228.30 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 1161.32 29851.40 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 751.96 13082.95 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 742.90 9564.89 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 979.65 43852.07 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 982.80 19910.19 Without OVOS services running

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 17, 2022

From the stream repo


CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 243.54 ms 779.49 ms
RK3588 Ubuntu20.04 NEON base.en 4 316.52 ms 1821.06 ms
RK3588 Ubuntu20.04 NEON small.en 4 618.93 ms 7117.69 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1514.88 ms 24139.92 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 4 233.86 ms 791.01 ms
RK3588 Ubuntu20.04 NEON base 4 297.93 ms 1813.69 ms
RK3588 Ubuntu20.04 NEON small 4 592.18 ms 7102.28 ms
RK3588 Ubuntu20.04 NEON medium 4 1587.36 ms 24147.87 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 740.34 ms
RK3588 Ubuntu20.04 NEON base 8 300.48 ms 1723.42 ms
RK3588 Ubuntu20.04 NEON small 8 620.58 ms 6392.47 ms
RK3588 Ubuntu20.04 NEON medium 8 1533.75 ms 21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 234.14 ms 681.53 ms
RK3588 Ubuntu20.04 NEON base.en 4 297.08 ms 1679.75 ms
RK3588 Ubuntu20.04 NEON small.en 4 599.98 ms 6867.66 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1492.73 ms 23600.45 ms

I tried to compile with openBlas but seemed to kill the make


From the master repo as didn't think about the repo after trying streaming input

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 2681.05 ms
RK3588 Ubuntu20.04 NEON base 8 283.56 ms 6132.44 ms
RK3588 Ubuntu20.04 NEON small 8 583.39 ms 24397.78 ms
RK3588 Ubuntu20.04 NEON medium 8 1490.98 85099.45 ms

@dodysw
Copy link
Contributor

dodysw commented Nov 17, 2022

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny.en 8 136.29 454.52
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 8 134.64 486.01
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 8 180.22 1184.80
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base.en 8 192.86 1197.85
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 8 367.55 4179.00
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small.en 8 378.27 4557.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 8 923.48 15552.61
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium.en 8 952.48 15708.63
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 large 8 1650.28 28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16
CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 16 143.17 437.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 16 184.10 1061.14
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 16 374.41 3645.64
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 16 935.45 13029.54

@matth
Copy link

matth commented Nov 21, 2022

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 8 125.92 230.33
Graviton 3 Ubuntu 22.04 NEON base 8 160.17 547.88
Graviton 3 Ubuntu 22.04 NEON small 8 299.59 2138.86
Graviton 3 Ubuntu 22.04 NEON medium 8 741.49 6999.33
Graviton 3 Ubuntu 22.04 NEON large 8 1313.95 14174.00

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 121.92 158.61
Graviton 3 Ubuntu 22.04 NEON base 16 156.01 386.78
Graviton 3 Ubuntu 22.04 NEON small 16 299.85 1596.38
Graviton 3 Ubuntu 22.04 NEON medium 16 750.93 5351.24
Graviton 3 Ubuntu 22.04 NEON large 16 1313.82 11115.69

@ggerganov
Copy link
Owner Author

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

@matth
Copy link

matth commented Nov 21, 2022

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 124.25 320.53
Graviton 3 Ubuntu 22.04 NEON base 16 156.91 734.22
Graviton 3 Ubuntu 22.04 NEON small 16 301.78 2812.75
Graviton 3 Ubuntu 22.04 NEON medium 16 714.23 9139.86
Graviton 3 Ubuntu 22.04 NEON large 16 1298.33 18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

@letsgitcracking
Copy link

ThinkPad P1 Gen 4

Intel Core i7-11800H | 64GB RAM | NVIDIA T1200 4GB GPU

Encoder

CPU OS Config Model Th Load Enc. Commit
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS tiny 8 554 233 57543c1
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS base 8 567 507 57543c1
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS small 8 704 1827 57543c1
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS medium 8 1088 5435 57543c1
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS large 8 1648 10413 57543c1
---
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS tiny 8 554 148 4774d2f
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS base 8 577 299 4774d2f
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS small 8 698 1056 4774d2f
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS medium 8 1089 2761 4774d2f
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS large 8 1646 5498 4774d2f
---
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS tiny 8 98 162 a792c40
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS base 8 131 319 a792c40
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS small 8 237 1107 a792c40
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS medium 8 627 2896 a792c40
i7-11800H Ubuntu 22.04.2 LTS AVX2 BLAS large 8 1279 5819 a792c40

memcpy

./bench -w 1 -t 1
memcpy: 18.38 GB/s (1 thread)

ggml_mul_mat

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
64 x   64: F16     15.8 GFLOPS (128 runs) | F32     18.3 GFLOPS (128 runs)
128 x  128: F16     82.8 GFLOPS (128 runs) | F32     85.3 GFLOPS (128 runs)
256 x  256: F16    208.0 GFLOPS (128 runs) | F32    212.0 GFLOPS (128 runs)
512 x  512: F16    523.1 GFLOPS (128 runs) | F32    490.4 GFLOPS (128 runs)
1024 x 1024: F16    984.3 GFLOPS (128 runs) | F32    940.6 GFLOPS (128 runs)
2048 x 2048: F16   1256.0 GFLOPS ( 74 runs) | F32   1232.0 GFLOPS ( 72 runs)
4096 x 4096: F16   1390.2 GFLOPS ( 11 runs) | F32   1380.1 GFLOPS ( 11 runs)

@marty1885
Copy link

Me and my friends did some benchmark and profiling. Apparently llama.cpp only uses the GPU for matrix multiplication. And currently the major time on CPU being Conv1D and softmax. So that'll be a good starting point to improve the performance on CUDA/OpenCL.

@vieenrose
Copy link

vieenrose commented Aug 6, 2023

Jetson Nano 4GB using cuBLAS

Compiler version

gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Memcpy

Running memcpy benchmark

memcpy: 3.18 GB/s (1 thread)
sum: 136902081526.000000

ggml_mul_mat

Running ggml_mul_mat benchmark with 4 threads

ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA Tegra X1
64 x 64: Q4_0 0.1 GFLOPS (128 runs) | Q4_1 0.1 GFLOPS (116 runs)
64 x 64: Q5_0 0.1 GFLOPS (103 runs) | Q5_1 0.1 GFLOPS (128 runs) | Q8_0 0.1 GFLOPS (128 runs)
64 x 64: F16 0.1 GFLOPS (128 runs) | F32 0.1 GFLOPS (122 runs)
128 x 128: Q4_0 0.4 GFLOPS (107 runs) | Q4_1 0.4 GFLOPS ( 91 runs)
128 x 128: Q5_0 0.3 GFLOPS ( 78 runs) | Q5_1 0.4 GFLOPS ( 90 runs) | Q8_0 0.4 GFLOPS ( 91 runs)
128 x 128: F16 0.4 GFLOPS (100 runs) | F32 0.4 GFLOPS ( 86 runs)
256 x 256: Q4_0 2.8 GFLOPS ( 85 runs) | Q4_1 2.7 GFLOPS ( 81 runs)
256 x 256: Q5_0 2.6 GFLOPS ( 78 runs) | Q5_1 2.7 GFLOPS ( 83 runs) | Q8_0 2.6 GFLOPS ( 77 runs)
256 x 256: F16 2.5 GFLOPS ( 74 runs) | F32 2.7 GFLOPS ( 81 runs)
512 x 512: Q4_0 12.5 GFLOPS ( 47 runs) | Q4_1 13.7 GFLOPS ( 52 runs)
512 x 512: Q5_0 14.4 GFLOPS ( 54 runs) | Q5_1 14.4 GFLOPS ( 54 runs) | Q8_0 14.6 GFLOPS ( 55 runs)
512 x 512: F16 15.2 GFLOPS ( 57 runs) | F32 17.5 GFLOPS ( 66 runs)
1024 x 1024: Q4_0 66.9 GFLOPS ( 32 runs) | Q4_1 76.9 GFLOPS ( 36 runs)
1024 x 1024: Q5_0 76.0 GFLOPS ( 36 runs) | Q5_1 75.8 GFLOPS ( 36 runs) | Q8_0 76.2 GFLOPS ( 36 runs)
1024 x 1024: F16 84.5 GFLOPS ( 40 runs) | F32 87.5 GFLOPS ( 41 runs)
2048 x 2048: Q4_0 150.3 GFLOPS ( 9 runs) | Q4_1 143.3 GFLOPS ( 9 runs)
2048 x 2048: Q5_0 143.8 GFLOPS ( 9 runs) | Q5_1 133.2 GFLOPS ( 8 runs) | Q8_0 132.8 GFLOPS ( 8 runs)
2048 x 2048: F16 137.4 GFLOPS ( 8 runs) | F32 135.6 GFLOPS ( 8 runs)
4096 x 4096: Q4_0 177.8 GFLOPS ( 3 runs) | Q4_1 168.4 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 176.5 GFLOPS ( 3 runs) | Q5_1 173.4 GFLOPS ( 3 runs) | Q8_0 173.9 GFLOPS ( 3 runs)
4096 x 4096: F16 166.8 GFLOPS ( 3 runs) | F32 172.6 GFLOPS ( 3 runs)

Model benchmark

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Jetson Nano JetPack 4.5.1 NEON BLAS tiny 4 1136 1998 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS tiny-q5_0 4 1118 2131 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS base 4 1176 3450 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS base-q5_0 4 1144 3711 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS small 4 19671 10162 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS small-q5_0 4 1209 9603 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS medium 4 67108 28672 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS medium-q5_0 4 1545 26479 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS tiny 1 1138 3306 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS tiny-q5_0 1 1132 3112 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS base 1 1172 5618 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS base-q5_0 1 1102 5516 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS small 1 1568 14845 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS small-q5_0 1 1225 14472 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS medium 1 66915 38635 a4bb2df
Jetson Nano JetPack 4.5.1 NEON BLAS medium-q5_0 1 1590 37408 a4bb2df

Running memcpy benchmark

memcpy: 3.21 GB/s (heat-up)
memcpy: 3.23 GB/s ( 1 thread)
memcpy: 3.21 GB/s ( 1 thread)
memcpy: 4.14 GB/s ( 2 thread)
memcpy: 4.56 GB/s ( 3 thread)
memcpy: 4.83 GB/s ( 4 thread)
sum: 783359998033.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA Tegra X1, compute capability 5.3, VMM: no
64 x 64: Q4_0 0.2 GFLOPS (128 runs) | Q4_1 0.3 GFLOPS (128 runs)
64 x 64: Q5_0 0.3 GFLOPS (128 runs) | Q5_1 0.4 GFLOPS (128 runs) | Q8_0 0.5 GFLOPS (128 runs)
64 x 64: F16 0.3 GFLOPS (128 runs) | F32 0.3 GFLOPS (128 runs)
128 x 128: Q4_0 3.4 GFLOPS (128 runs) | Q4_1 3.4 GFLOPS (128 runs)
128 x 128: Q5_0 1.6 GFLOPS (128 runs) | Q5_1 6.1 GFLOPS (128 runs) | Q8_0 5.3 GFLOPS (128 runs)
128 x 128: F16 3.6 GFLOPS (128 runs) | F32 1.5 GFLOPS (128 runs)
256 x 256: Q4_0 13.0 GFLOPS (128 runs) | Q4_1 10.3 GFLOPS (128 runs)
256 x 256: Q5_0 17.0 GFLOPS (128 runs) | Q5_1 12.2 GFLOPS (128 runs) | Q8_0 19.6 GFLOPS (128 runs)
256 x 256: F16 20.5 GFLOPS (128 runs) | F32 11.9 GFLOPS (128 runs)
512 x 512: Q4_0 43.6 GFLOPS (128 runs) | Q4_1 49.0 GFLOPS (128 runs)
512 x 512: Q5_0 52.1 GFLOPS (128 runs) | Q5_1 50.9 GFLOPS (128 runs) | Q8_0 50.0 GFLOPS (128 runs)
512 x 512: F16 49.8 GFLOPS (128 runs) | F32 48.0 GFLOPS (128 runs)
1024 x 1024: Q4_0 80.2 GFLOPS ( 38 runs) | Q4_1 110.6 GFLOPS ( 52 runs)
1024 x 1024: Q5_0 120.7 GFLOPS ( 57 runs) | Q5_1 108.5 GFLOPS ( 51 runs) | Q8_0 123.2 GFLOPS ( 58 runs)
1024 x 1024: F16 112.0 GFLOPS ( 53 runs) | F32 108.3 GFLOPS ( 51 runs)
2048 x 2048: Q4_0 147.9 GFLOPS ( 9 runs) | Q4_1 151.4 GFLOPS ( 9 runs)
2048 x 2048: Q5_0 159.3 GFLOPS ( 10 runs) | Q5_1 141.2 GFLOPS ( 9 runs) | Q8_0 143.1 GFLOPS ( 9 runs)
2048 x 2048: F16 151.4 GFLOPS ( 9 runs) | F32 140.4 GFLOPS ( 9 runs)
4096 x 4096: Q4_0 176.2 GFLOPS ( 3 runs) | Q4_1 180.7 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 180.7 GFLOPS ( 3 runs) | Q5_1 182.2 GFLOPS ( 3 runs) | Q8_0 183.6 GFLOPS ( 3 runs)
4096 x 4096: F16 179.4 GFLOPS ( 3 runs) | F32 174.2 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA tiny 4 11.36 15.47 7.38 0.63 3d42463
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA tiny-q5_1 4 11.32 19.04 7.46 0.63 3d42463
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA base 4 18.98 25.56 12.59 1.10 3d42463
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA base-q5_1 4 18.58 32.62 12.09 1.10 3d42463
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA small 4 872.71 70.89 35.06 3.10 3d42463
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA small-q5_1 4 871.33 90.27 33.91 3.09 3d42463
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA medium 4 6930.15 183.71 95.69 8.64 3d42463
Jetson Nano JetPack 4.5.1 NEON BLAS CUDA medium-q5_0 4 6889.33 230.09 91.06 8.54 3d42463

@vieenrose
Copy link

Jetson Orin Nano Developper Kit using cuBLAS

Compiler version

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Memcpy

Running memcpy benchmark

memcpy: 6.10 GB/s (1 thread)
sum: 136902081526.000000

ggml_mul_mat

Running ggml_mul_mat benchmark with 6 threads

ggml_init_cublas: found 1 CUDA devices:
Device 0: Orin
64 x 64: Q4_0 0.7 GFLOPS (128 runs) | Q4_1 0.7 GFLOPS (128 runs)
64 x 64: Q5_0 0.6 GFLOPS (128 runs) | Q5_1 0.6 GFLOPS (128 runs) | Q8_0 0.7 GFLOPS (128 runs)
64 x 64: F16 0.7 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs)
128 x 128: Q4_0 4.4 GFLOPS (128 runs) | Q4_1 3.6 GFLOPS (128 runs)
128 x 128: Q5_0 3.2 GFLOPS (128 runs) | Q5_1 3.7 GFLOPS (128 runs) | Q8_0 3.8 GFLOPS (128 runs)
128 x 128: F16 4.0 GFLOPS (128 runs) | F32 3.8 GFLOPS (128 runs)
256 x 256: Q4_0 16.0 GFLOPS (128 runs) | Q4_1 25.7 GFLOPS (128 runs)
256 x 256: Q5_0 25.8 GFLOPS (128 runs) | Q5_1 29.6 GFLOPS (128 runs) | Q8_0 27.6 GFLOPS (128 runs)
256 x 256: F16 30.2 GFLOPS (128 runs) | F32 19.0 GFLOPS (128 runs)
512 x 512: Q4_0 170.9 GFLOPS (128 runs) | Q4_1 167.9 GFLOPS (128 runs)
512 x 512: Q5_0 168.7 GFLOPS (128 runs) | Q5_1 119.6 GFLOPS (128 runs) | Q8_0 134.6 GFLOPS (128 runs)
512 x 512: F16 150.6 GFLOPS (128 runs) | F32 166.4 GFLOPS (128 runs)
1024 x 1024: Q4_0 354.5 GFLOPS (128 runs) | Q4_1 571.7 GFLOPS (128 runs)
1024 x 1024: Q5_0 558.6 GFLOPS (128 runs) | Q5_1 520.5 GFLOPS (128 runs) | Q8_0 514.3 GFLOPS (128 runs)
1024 x 1024: F16 491.5 GFLOPS (128 runs) | F32 505.7 GFLOPS (128 runs)
2048 x 2048: Q4_0 1006.5 GFLOPS ( 59 runs) | Q4_1 1123.9 GFLOPS ( 66 runs)
2048 x 2048: Q5_0 1144.5 GFLOPS ( 67 runs) | Q5_1 1125.0 GFLOPS ( 66 runs) | Q8_0 1102.3 GFLOPS ( 65 runs)
2048 x 2048: F16 1035.8 GFLOPS ( 61 runs) | F32 947.1 GFLOPS ( 56 runs)
4096 x 4096: Q4_0 1735.8 GFLOPS ( 14 runs) | Q4_1 1744.1 GFLOPS ( 13 runs)
4096 x 4096: Q5_0 1613.4 GFLOPS ( 12 runs) | Q5_1 1529.1 GFLOPS ( 12 runs) | Q8_0 1621.6 GFLOPS ( 12 runs)
4096 x 4096: F16 1543.4 GFLOPS ( 12 runs) | F32 1488.8 GFLOPS ( 11 runs)

Model benchmark

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Jetson Orin Nano JetPack 5.1.1 NEON BLAS tiny 6 1399 537 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS tiny-q5_0 6 1401 486 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS base 6 1523 1101 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS base-q5_0 6 1420 992 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS small 6 2735 2861 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS small-q5_0 6 1514 2656 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS medium 6 5639 7836 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS medium-q5_0 6 1820 8267 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS large 6 11349 23323 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS large-q5_0 6 2241 13197 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS tiny 1 1418 1339 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS tiny-q5_0 1 1391 1382 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS base 1 1503 2422 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS base-q5_0 1 1385 2405 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS small 1 2798 6193 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS small-q5_0 1 1635 6175 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS medium 1 6368 14623 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS medium-q5_0 1 1856 14424 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS large 1 11055 25367 a4bb2df
Jetson Orin Nano JetPack 5.1.1 NEON BLAS large-q5_0 1 2226 24363 a4bb2df

@vieenrose
Copy link

vieenrose commented Aug 7, 2023

Intel Core i5-8400 CPU + NVIDIA GeForce GTX 1070 with cuBLAS

Compiler version

gcc (conda-forge gcc 10.4.0-19) 10.4.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

Memcpy

Running memcpy benchmark

memcpy: 13.17 GB/s (1 thread)
sum: -536869898.000000

ggml_mul_mat

Running ggml_mul_mat benchmark with 6 threads

ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070
64 x 64: Q4_0 0.2 GFLOPS (128 runs) | Q4_1 0.1 GFLOPS (128 runs)
64 x 64: Q5_0 0.2 GFLOPS (128 runs) | Q5_1 0.2 GFLOPS (128 runs) | Q8_0 0.2 GFLOPS (128 runs)
64 x 64: F16 0.3 GFLOPS (128 runs) | F32 0.2 GFLOPS (128 runs)
128 x 128: Q4_0 1.8 GFLOPS (128 runs) | Q4_1 2.0 GFLOPS (128 runs)
128 x 128: Q5_0 2.2 GFLOPS (128 runs) | Q5_1 2.0 GFLOPS (128 runs) | Q8_0 1.8 GFLOPS (128 runs)
128 x 128: F16 2.0 GFLOPS (128 runs) | F32 2.1 GFLOPS (128 runs)
256 x 256: Q4_0 15.8 GFLOPS (128 runs) | Q4_1 13.3 GFLOPS (128 runs)
256 x 256: Q5_0 15.6 GFLOPS (128 runs) | Q5_1 16.5 GFLOPS (128 runs) | Q8_0 13.3 GFLOPS (128 runs)
256 x 256: F16 16.1 GFLOPS (128 runs) | F32 15.9 GFLOPS (128 runs)
512 x 512: Q4_0 115.3 GFLOPS (128 runs) | Q4_1 92.0 GFLOPS (128 runs)
512 x 512: Q5_0 62.0 GFLOPS (128 runs) | Q5_1 35.4 GFLOPS (128 runs) | Q8_0 41.0 GFLOPS (128 runs)
512 x 512: F16 86.7 GFLOPS (128 runs) | F32 122.5 GFLOPS (128 runs)
1024 x 1024: Q4_0 640.1 GFLOPS (128 runs) | Q4_1 594.0 GFLOPS (128 runs)
1024 x 1024: Q5_0 645.8 GFLOPS (128 runs) | Q5_1 648.0 GFLOPS (128 runs) | Q8_0 545.0 GFLOPS (128 runs)
1024 x 1024: F16 535.3 GFLOPS (128 runs) | F32 428.5 GFLOPS (128 runs)
2048 x 2048: Q4_0 1123.6 GFLOPS ( 66 runs) | Q4_1 848.5 GFLOPS ( 50 runs)
2048 x 2048: Q5_0 1245.0 GFLOPS ( 73 runs) | Q5_1 1795.5 GFLOPS (105 runs) | Q8_0 938.0 GFLOPS ( 55 runs)
2048 x 2048: F16 1363.3 GFLOPS ( 80 runs) | F32 1288.0 GFLOPS ( 75 runs)
4096 x 4096: Q4_0 2804.1 GFLOPS ( 21 runs) | Q4_1 3196.2 GFLOPS ( 24 runs)
4096 x 4096: Q5_0 3149.1 GFLOPS ( 23 runs) | Q5_1 2907.3 GFLOPS ( 22 runs) | Q8_0 2949.2 GFLOPS ( 22 runs)
4096 x 4096: F16 2864.8 GFLOPS ( 21 runs) | F32 2960.5 GFLOPS ( 22 runs)

Model benchmark

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS tiny 6 432 276 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS tiny-q5_0 6 382 198 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS base 6 544 531 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS base-q5_0 6 511 388 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS small 6 780 1659 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS small-q5_0 6 527 1100 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS medium 6 1510 2954 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS medium-q5_0 6 795 3315 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS large 6 2611 5024 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS large-q5_0 6 1205 5231 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS tiny 1 442 397 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS tiny-q5_0 1 406 391 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS base 1 509 744 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS base-q5_0 1 425 738 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS small 1 774 2140 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS small-q5_0 1 516 2050 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS medium 1 1509 5569 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS medium-q5_0 1 795 5611 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS large 1 2630 9155 a4bb2df
i5-8400 + GTX 1070 Ubuntu 18.04.6 LTS AVX2 BLAS large-q5_0 1 1173 9078 a4bb2df

@tazz4843
Copy link
Contributor

Valve Jupiter (AMD Custom APU 0405)

Update of the last time I ran this (last time was CPU only due to being on SteamOS): #89 (comment)

Running memcpy benchmark

memcpy: 12.87 GB/s (1 thread)
sum:    -536869898.000000

CPU only

Running ggml_mul_mat benchmark with 8 threads

  64 x   64: Q4_0     2.3 GFLOPS (128 runs) | Q4_1     2.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.3 GFLOPS (128 runs) | Q5_1     1.4 GFLOPS (128 runs) | Q8_0     2.5 GFLOPS (128 runs)
  64 x   64: F16      2.4 GFLOPS (128 runs) | F32      2.5 GFLOPS (128 runs)
 128 x  128: Q4_0    16.1 GFLOPS (128 runs) | Q4_1    14.7 GFLOPS (128 runs)
 128 x  128: Q5_0    15.4 GFLOPS (128 runs) | Q5_1    15.3 GFLOPS (128 runs) | Q8_0    16.8 GFLOPS (128 runs)
 128 x  128: F16     16.2 GFLOPS (128 runs) | F32     14.7 GFLOPS (128 runs)
 256 x  256: Q4_0    61.3 GFLOPS (128 runs) | Q4_1    58.5 GFLOPS (128 runs)
 256 x  256: Q5_0    55.1 GFLOPS (128 runs) | Q5_1    53.3 GFLOPS (128 runs) | Q8_0    65.9 GFLOPS (128 runs)
 256 x  256: F16     55.2 GFLOPS (128 runs) | F32     54.8 GFLOPS (128 runs)
 512 x  512: Q4_0   107.3 GFLOPS (128 runs) | Q4_1   109.4 GFLOPS (128 runs)
 512 x  512: Q5_0    88.9 GFLOPS (128 runs) | Q5_1    84.6 GFLOPS (128 runs) | Q8_0   129.5 GFLOPS (128 runs)
 512 x  512: F16     77.7 GFLOPS (128 runs) | F32     83.7 GFLOPS (128 runs)
1024 x 1024: Q4_0   127.9 GFLOPS ( 60 runs) | Q4_1   132.8 GFLOPS ( 62 runs)
1024 x 1024: Q5_0   103.0 GFLOPS ( 48 runs) | Q5_1   102.6 GFLOPS ( 48 runs) | Q8_0   159.2 GFLOPS ( 75 runs)
1024 x 1024: F16     82.5 GFLOPS ( 39 runs) | F32     84.6 GFLOPS ( 40 runs)
2048 x 2048: Q4_0   136.0 GFLOPS (  8 runs) | Q4_1   141.5 GFLOPS (  9 runs)
2048 x 2048: Q5_0   107.6 GFLOPS (  7 runs) | Q5_1   108.3 GFLOPS (  7 runs) | Q8_0   164.3 GFLOPS ( 10 runs)
2048 x 2048: F16     83.9 GFLOPS (  5 runs) | F32     92.2 GFLOPS (  6 runs)
4096 x 4096: Q4_0   138.2 GFLOPS (  3 runs) | Q4_1   144.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0   109.5 GFLOPS (  3 runs) | Q5_1   110.5 GFLOPS (  3 runs) | Q8_0   167.6 GFLOPS (  3 runs)
4096 x 4096: F16     83.0 GFLOPS (  3 runs) | F32     74.1 GFLOPS (  3 runs)
CPU OS Config Model Th Enc. Dec. PP Commit
AMD Custom APU 0405 Arch Linux AVX2 tiny 8 630.13 3.49 105.05 903c957
AMD Custom APU 0405 Arch Linux AVX2 tiny-q5_1 8 590.61 2.08 96.99 903c957
AMD Custom APU 0405 Arch Linux AVX2 base 8 1465.93 5.90 248.97 903c957
AMD Custom APU 0405 Arch Linux AVX2 base-q5_1 8 1333.47 3.43 224.44 903c957
AMD Custom APU 0405 Arch Linux AVX2 small 8 5504.17 15.86 938.16 903c957
AMD Custom APU 0405 Arch Linux AVX2 small-q5_1 8 4814.22 9.51 814.53 903c957

using iGPU

Running ggml_mul_mat benchmark with 8 threads

ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1033'
ggml_opencl: device FP16 support: true
  64 x   64: Q4_0     0.5 GFLOPS (128 runs) | Q4_1     0.5 GFLOPS (128 runs)
  64 x   64: Q5_0     0.5 GFLOPS (128 runs) | Q5_1     0.5 GFLOPS (128 runs) | Q8_0     0.5 GFLOPS (128 runs)
  64 x   64: F16      0.5 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     4.0 GFLOPS (128 runs) | Q4_1     3.8 GFLOPS (128 runs)
 128 x  128: Q5_0     3.9 GFLOPS (128 runs) | Q5_1     3.9 GFLOPS (128 runs) | Q8_0     3.8 GFLOPS (128 runs)
 128 x  128: F16      3.8 GFLOPS (128 runs) | F32      4.0 GFLOPS (128 runs)
 256 x  256: Q4_0    24.4 GFLOPS (128 runs) | Q4_1    23.9 GFLOPS (128 runs)
 256 x  256: Q5_0    23.3 GFLOPS (128 runs) | Q5_1    23.9 GFLOPS (128 runs) | Q8_0    23.3 GFLOPS (128 runs)
 256 x  256: F16     25.9 GFLOPS (128 runs) | F32     24.3 GFLOPS (128 runs)
 512 x  512: Q4_0    70.6 GFLOPS (128 runs) | Q4_1    70.3 GFLOPS (128 runs)
 512 x  512: Q5_0    69.2 GFLOPS (128 runs) | Q5_1    70.7 GFLOPS (128 runs) | Q8_0    68.3 GFLOPS (128 runs)
 512 x  512: F16    140.9 GFLOPS (128 runs) | F32     67.0 GFLOPS (128 runs)
1024 x 1024: Q4_0   267.7 GFLOPS (125 runs) | Q4_1   268.2 GFLOPS (125 runs)
1024 x 1024: Q5_0   266.0 GFLOPS (124 runs) | Q5_1   265.2 GFLOPS (124 runs) | Q8_0   278.5 GFLOPS (128 runs)
1024 x 1024: F16    304.0 GFLOPS (128 runs) | F32    273.9 GFLOPS (128 runs)
2048 x 2048: Q4_0   622.1 GFLOPS ( 37 runs) | Q4_1   635.6 GFLOPS ( 38 runs)
2048 x 2048: Q5_0   632.1 GFLOPS ( 37 runs) | Q5_1   634.8 GFLOPS ( 37 runs) | Q8_0   627.2 GFLOPS ( 37 runs)
2048 x 2048: F16    524.5 GFLOPS ( 31 runs) | F32    625.6 GFLOPS ( 37 runs)
4096 x 4096: Q4_0   724.7 GFLOPS (  6 runs) | Q4_1   728.6 GFLOPS (  6 runs)
4096 x 4096: Q5_0   723.8 GFLOPS (  6 runs) | Q5_1   726.4 GFLOPS (  6 runs) | Q8_0   722.9 GFLOPS (  6 runs)
4096 x 4096: F16    857.6 GFLOPS (  7 runs) | F32    725.6 GFLOPS (  6 runs)
CPU OS Config Model Th Enc. Dec. PP Commit
AMD Custom APU 0405 (iGPU) Arch Linux AVX2 BLAS tiny 8 574.83 3.67 206.55 903c957
AMD Custom APU 0405 (iGPU) Arch Linux AVX2 BLAS tiny-q5_1 8 599.32 2.49 231.62 903c957
AMD Custom APU 0405 (iGPU) Arch Linux AVX2 BLAS base 8 1119.73 6.06 379.41 903c957
AMD Custom APU 0405 (iGPU) Arch Linux AVX2 BLAS base-q5_1 8 1168.92 4.01 484.14 903c957
AMD Custom APU 0405 (iGPU) Arch Linux AVX2 BLAS small 8 3492.86 16.07 1093.41 903c957
AMD Custom APU 0405 (iGPU) Arch Linux AVX2 BLAS small-q5_1 8 3476.28 11.12 1403.84 903c957

Didn't test anything more as I didn't have the disk space to download them.

@henry2man
Copy link

Hi there. These are some tests with a Mac Studio M1 Ultra with MacOS Sonoma 14.0.

Running memcpy benchmark

memcpy: 36.64 GB/s (1 thread)
sum:    -536871564.000000

Running ggml_mul_mat benchmark with 20 threads

  64 x   64: Q4_0     1.7 GFLOPS (128 runs) | Q4_1     2.1 GFLOPS (128 runs)
  64 x   64: Q5_0     2.5 GFLOPS (128 runs) | Q5_1     2.8 GFLOPS (128 runs) | Q8_0     2.8 GFLOPS (128 runs)
  64 x   64: F16      2.7 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    20.1 GFLOPS (128 runs)
 128 x  128: Q5_0    19.5 GFLOPS (128 runs) | Q5_1    18.4 GFLOPS (128 runs) | Q8_0    17.2 GFLOPS (128 runs)
 128 x  128: F16     19.0 GFLOPS (128 runs) | F32     21.1 GFLOPS (128 runs)
 256 x  256: Q4_0   101.8 GFLOPS (128 runs) | Q4_1    94.0 GFLOPS (128 runs)
 256 x  256: Q5_0   100.0 GFLOPS (128 runs) | Q5_1    98.2 GFLOPS (128 runs) | Q8_0    97.4 GFLOPS (128 runs)
 256 x  256: F16     97.1 GFLOPS (128 runs) | F32     92.9 GFLOPS (128 runs)
 512 x  512: Q4_0   388.1 GFLOPS (128 runs) | Q4_1   365.4 GFLOPS (128 runs)
 512 x  512: Q5_0   412.8 GFLOPS (128 runs) | Q5_1   432.6 GFLOPS (128 runs) | Q8_0   405.2 GFLOPS (128 runs)
 512 x  512: F16    444.4 GFLOPS (128 runs) | F32    491.0 GFLOPS (128 runs)
1024 x 1024: Q4_0  1172.6 GFLOPS (128 runs) | Q4_1  1235.8 GFLOPS (128 runs)
1024 x 1024: Q5_0  1462.5 GFLOPS (128 runs) | Q5_1  1483.5 GFLOPS (128 runs) | Q8_0  1432.8 GFLOPS (128 runs)
1024 x 1024: F16   1515.2 GFLOPS (128 runs) | F32   1738.2 GFLOPS (128 runs)
2048 x 2048: Q4_0  2588.0 GFLOPS (128 runs) | Q4_1  2340.6 GFLOPS (128 runs)
2048 x 2048: Q5_0  2508.7 GFLOPS (128 runs) | Q5_1  2476.3 GFLOPS (128 runs) | Q8_0  2717.1 GFLOPS (128 runs)
2048 x 2048: F16   2707.1 GFLOPS (128 runs) | F32   2898.0 GFLOPS (128 runs)
4096 x 4096: Q4_0  2698.6 GFLOPS ( 20 runs) | Q4_1  2538.5 GFLOPS ( 19 runs)
4096 x 4096: Q5_0  2593.5 GFLOPS ( 19 runs) | Q5_1  2594.1 GFLOPS ( 19 runs) | Q8_0  2664.0 GFLOPS ( 20 runs)
4096 x 4096: F16   2748.2 GFLOPS ( 20 runs) | F32   2825.4 GFLOPS ( 21 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Enc. Dec. PP Commit
Apple M1 Ultra MacOS 14.0 NEON BLAS METAL tiny 20 22.85 1.97 4.22 c76c11e
Apple M1 Ultra MacOS 14.0 NEON BLAS METAL base 20 31.55 2.78 6.48 c76c11e
Apple M1 Ultra MacOS 14.0 NEON BLAS METAL small 20 81.78 4.91 17.00 c76c11e
Apple M1 Ultra MacOS 14.0 NEON BLAS METAL medium 20 217.00 10.54 41.68 c76c11e
Apple M1 Ultra MacOS 14.0 NEON BLAS METAL large 20 358.84 14.48 73.92 c76c11e

@Bad-Science
Copy link

@henry2man amazing results. What compiler options are you using / are you using ANE?

@henry2man
Copy link

henry2man commented Oct 12, 2023

@henry2man amazing results. What compiler options are you using / are you using ANE?

I was simply experimenting with the project, so as far as I remember I just used the extras/all shell script without any other tweaks, just following the given instructions.

If you want me to execute specific tests please tell me and I'll be glad to contribute the results here.

EDIT: after a deeper research, I think I didn't make use of ANE, but I'm not 100% sure.

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
@nickovs
Copy link

nickovs commented Nov 3, 2023

Results for the new Raspberry Pi 5. Tests performed on a board with the active cooler. uname -a output is:

Linux newpi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux
CPU OS Config Model Threads Encode Decode Commit
BCM2712 Bookworm 12.2 NEON 4 tiny 1106.11 183.67 54c978c
BCM2712 Bookworm 12.2 NEON 4 tiny.en 1109.66 201.3 54c978c
BCM2712 Bookworm 12.2 NEON 4 base 2479.82 346.65 54c978c
BCM2712 Bookworm 12.2 NEON 4 base.en 2465.12 363.86 54c978c
BCM2712 Bookworm 12.2 NEON 4 small 8308.3 963.24 54c978c
BCM2712 Bookworm 12.2 NEON 4 small.en 8342.25 1119.25 54c978c
BCM2712 Bookworm 12.2 NEON 4 medium.en 26407.77 2893.55 54c978c
BCM2712 Bookworm 12.2 NEON 4 medium 26468.86 2919.43 54c978c

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.

NOTE: The packaged version of OpenBLAS has not been recompiled for the new CPU architecture, so it is about 50% slower than whisper.cpp's native NEON implementation. I will post benchmarks using OpenBLAS once I have built a version for the new CPU.

The memcpy and ggml_mul_mat benchmarks show:

memcpy: 4.64 GB/s (1 thread)
sum:    136902081526.000000

  64 x   64: Q4_0     5.5 GFLOPS (128 runs) | Q4_1     5.1 GFLOPS (128 runs)
  64 x   64: Q5_0     4.7 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
  64 x   64: F16      5.0 GFLOPS (128 runs) | F32      4.9 GFLOPS (128 runs)
 128 x  128: Q4_0    22.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.7 GFLOPS (128 runs) | Q5_1    20.3 GFLOPS (128 runs) | Q8_0    23.9 GFLOPS (128 runs)
 128 x  128: F16     26.3 GFLOPS (128 runs) | F32     13.3 GFLOPS (128 runs)
 256 x  256: Q4_0    39.0 GFLOPS (128 runs) | Q4_1    49.4 GFLOPS (128 runs)
 256 x  256: Q5_0    33.0 GFLOPS (128 runs) | Q5_1    37.5 GFLOPS (128 runs) | Q8_0    58.6 GFLOPS (128 runs)
 256 x  256: F16     64.1 GFLOPS (128 runs) | F32     48.4 GFLOPS (128 runs)
 512 x  512: Q4_0    62.6 GFLOPS (128 runs) | Q4_1    62.3 GFLOPS (128 runs)
 512 x  512: Q5_0    49.9 GFLOPS (128 runs) | Q5_1    46.1 GFLOPS (128 runs) | Q8_0    76.2 GFLOPS (128 runs)
 512 x  512: F16     80.1 GFLOPS (128 runs) | F32     51.1 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.9 GFLOPS ( 32 runs) | Q4_1    67.6 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    53.5 GFLOPS ( 25 runs) | Q5_1    50.4 GFLOPS ( 24 runs) | Q8_0    85.4 GFLOPS ( 40 runs)
1024 x 1024: F16     92.9 GFLOPS ( 44 runs) | F32     48.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    71.0 GFLOPS (  5 runs) | Q4_1    72.2 GFLOPS (  5 runs)
2048 x 2048: Q5_0    55.7 GFLOPS (  4 runs) | Q5_1    52.3 GFLOPS (  4 runs) | Q8_0    87.6 GFLOPS (  6 runs)
2048 x 2048: F16     93.1 GFLOPS (  6 runs) | F32     43.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.2 GFLOPS (  3 runs) | Q4_1    73.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    55.9 GFLOPS (  3 runs) | Q5_1    52.7 GFLOPS (  3 runs) | Q8_0    86.9 GFLOPS (  3 runs)
4096 x 4096: F16     86.8 GFLOPS (  3 runs) | F32     38.4 GFLOPS (  3 runs)

@marjisound
Copy link

CPU details: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
GPU name: NVIDIA Tesla T4
OS: Linux 14 22.04.1-Ubuntu
Compiler: cc (Ubuntu 11.4.0-1ubuntu1 22.04) 11.4.0

WHISPER_CUBLAS=1 make -j bench && ./extra/bench-all.sh

I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx2 -mfma -mf16c -mavx -msse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make: 'bench' is up to date.
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 5.05 GB/s
sum:    -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:   64 x   64: Q4_0     3.8 GFLOPS (128 runs) / Q4_1     3.8 GFLOPS (128 runs) / F16     3.8 GFLOPS (128 runs) / F32     3.9 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0    23.6 GFLOPS (128 runs) / Q4_1    24.0 GFLOPS (128 runs) / F16    22.1 GFLOPS (128 runs) / F32    22.4 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    90.3 GFLOPS (128 runs) / Q4_1   100.0 GFLOPS (128 runs) / F16    92.0 GFLOPS (128 runs) / F32    92.3 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0   278.8 GFLOPS (128 runs) / Q4_1   277.6 GFLOPS (128 runs) / F16   244.9 GFLOPS (128 runs) / F32   242.1 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0   859.2 GFLOPS (128 runs) / Q4_1   853.6 GFLOPS (128 runs) / F16   648.3 GFLOPS (128 runs) / F32   685.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: Q4_0  1583.4 GFLOPS ( 93 runs) / Q4_1  1585.1 GFLOPS ( 93 runs) / F16  1383.9 GFLOPS ( 81 runs) / F32  1359.7 GFLOPS ( 80 runs)
ggml_mul_mat: 4096 x 4096: Q4_0  2525.9 GFLOPS ( 19 runs) / Q4_1  2658.6 GFLOPS ( 20 runs) / F16  2716.0 GFLOPS ( 20 runs) / F32  2302.7 GFLOPS ( 17 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Xeon(R) Ubuntu AVX2 BLAS tiny 4 429 550 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS base 4 521 1133 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS small 4 798 3025 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS medium 4 1701 7639 fa8dbdc
Xeon(R) Ubuntu AVX2 BLAS large 4 2966 12927 fa8dbdc

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 3, 2023

Whats happening with commit 8a2bee6?
I was just interested with the same Master Opi5 vs Rpi5, but seem to have an extra PP that I am sure I will find a use for
Rpi 5gb
Linux raspberrypi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

memcpy: 5.32 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.0 GFLOPS (128 runs) | Q4_1     5.9 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     1.9 GFLOPS (128 runs)
  64 x   64: F16      6.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 128 x  128: Q4_0    23.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    21.4 GFLOPS (128 runs) | Q5_1    20.4 GFLOPS (128 runs) | Q8_0    11.4 GFLOPS (128 runs)
 128 x  128: F16     28.6 GFLOPS (128 runs) | F32     26.2 GFLOPS (128 runs)
 256 x  256: Q4_0    49.8 GFLOPS (128 runs) | Q4_1    49.6 GFLOPS (128 runs)
 256 x  256: Q5_0    40.9 GFLOPS (128 runs) | Q5_1    24.8 GFLOPS (128 runs) | Q8_0    59.0 GFLOPS (128 runs)
 256 x  256: F16     63.0 GFLOPS (128 runs) | F32     29.6 GFLOPS (128 runs)
 512 x  512: Q4_0    56.6 GFLOPS (128 runs) | Q4_1    56.5 GFLOPS (128 runs)
 512 x  512: Q5_0    30.4 GFLOPS (114 runs) | Q5_1    36.5 GFLOPS (128 runs) | Q8_0    71.2 GFLOPS (128 runs)
 512 x  512: F16     64.6 GFLOPS (128 runs) | F32     35.2 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.4 GFLOPS ( 32 runs) | Q4_1    68.7 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    38.1 GFLOPS ( 18 runs) | Q5_1    32.3 GFLOPS ( 16 runs) | Q8_0    61.3 GFLOPS ( 29 runs)
1024 x 1024: F16     71.7 GFLOPS ( 34 runs) | F32     35.1 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    71.4 GFLOPS (  5 runs) | Q4_1    71.5 GFLOPS (  5 runs)
2048 x 2048: Q5_0    38.1 GFLOPS (  3 runs) | Q5_1    36.9 GFLOPS (  3 runs) | Q8_0    63.5 GFLOPS (  4 runs)
2048 x 2048: F16     68.6 GFLOPS (  4 runs) | F32     32.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    66.8 GFLOPS (  3 runs) | Q4_1    62.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.5 GFLOPS (  3 runs) | Q5_1    37.0 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     61.5 GFLOPS (  3 runs) | F32     29.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Rpi5 BCM2712 | bookworm |             NEON |        tiny |   4 | 1206.23 |    6.67 |  198.84 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |        base |   4 | 2862.56 |   11.74 |  466.51 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |       small |   4 | 9630.88 |   32.81 | 1650.18 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |      medium |   4 |      ms |   99.64 | 5601.57 | 8a2bee6 |

Opi5 4gb
Linux ubuntu 6.6.0 #1 SMP PREEMPT Mon Oct 30 22:54:25 GMT 2023 aarch64 aarch64 aarch64 GNU/Linux
Mainline Linux than the Rockchip BSP https://github.com/Joshua-Riek/ubuntu-rockchip/releases/tag/v1.29.1

memcpy: 10.93 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     4.1 GFLOPS (128 runs)
  64 x   64: Q5_0     5.9 GFLOPS (128 runs) | Q5_1     6.0 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      4.1 GFLOPS (128 runs) | F32      6.8 GFLOPS (128 runs)
 128 x  128: Q4_0    14.0 GFLOPS (128 runs) | Q4_1    19.1 GFLOPS (128 runs)
 128 x  128: Q5_0    15.5 GFLOPS (128 runs) | Q5_1    12.7 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     22.1 GFLOPS (128 runs) | F32     21.2 GFLOPS (128 runs)
 256 x  256: Q4_0    45.0 GFLOPS (128 runs) | Q4_1    45.0 GFLOPS (128 runs)
 256 x  256: Q5_0    29.0 GFLOPS (128 runs) | Q5_1    29.6 GFLOPS (128 runs) | Q8_0    42.8 GFLOPS (128 runs)
 256 x  256: F16     42.5 GFLOPS (128 runs) | F32     42.6 GFLOPS (128 runs)
 512 x  512: Q4_0    55.8 GFLOPS (128 runs) | Q4_1    56.0 GFLOPS (128 runs)
 512 x  512: Q5_0    35.5 GFLOPS (128 runs) | Q5_1    36.7 GFLOPS (128 runs) | Q8_0    61.9 GFLOPS (128 runs)
 512 x  512: F16     80.7 GFLOPS (128 runs) | F32     49.6 GFLOPS (128 runs)
1024 x 1024: Q4_0    60.6 GFLOPS ( 29 runs) | Q4_1    61.4 GFLOPS ( 29 runs)
1024 x 1024: Q5_0    37.6 GFLOPS ( 18 runs) | Q5_1    39.3 GFLOPS ( 19 runs) | Q8_0    68.2 GFLOPS ( 32 runs)
1024 x 1024: F16     93.1 GFLOPS ( 44 runs) | F32     46.4 GFLOPS ( 22 runs)
2048 x 2048: Q4_0    63.1 GFLOPS (  4 runs) | Q4_1    64.1 GFLOPS (  4 runs)
2048 x 2048: Q5_0    39.2 GFLOPS (  3 runs) | Q5_1    41.0 GFLOPS (  3 runs) | Q8_0    70.9 GFLOPS (  5 runs)
2048 x 2048: F16     87.9 GFLOPS (  6 runs) | F32     41.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    65.3 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.7 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.7 GFLOPS (  3 runs)
4096 x 4096: F16     80.7 GFLOPS (  3 runs) | F32     38.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        tiny |   4 |  782.52 |    3.10 |  135.25 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        base |   4 | 1754.69 |   11.81 |  304.06 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |       small |   4 | 6226.10 |   15.26 | 1075.54 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |      medium |   4 |      ms |   44.75 | 3425.05 | 8a2bee6 |

@ggerganov
Copy link
Owner Author

ggerganov commented Nov 3, 2023

@nickovs These are some very interesting results. Looking forward to the OpenBLAS results as well.

@StuartIanNaylor The PP timing is the "prompt processing" time for a prompt of 256 tokens. As we transcribe with whisper, the context (i.e. the previously transcribed text) grows up to n_text_ctx. For each new audio segment that we process, we have to process the context. This processing is very similar to the token-by-token text generation during decoding, but it is much faster since we process 256 tokens at once.

@nickovs
Copy link

nickovs commented Nov 3, 2023

By way of comparison to the benchmarks I posted above, here is are the matrix multiplication numbers for the same Raspberry Pi 5 using OpenBLAS. It is notable that Whisper.cpp's native NEON code outperforms OpenBLAS on the Pi5 for everything except FP32, where OpenBLAS wins by some margin.

  64 x   64: Q4_0     4.4 GFLOPS (128 runs) | Q4_1     4.3 GFLOPS (128 runs)
  64 x   64: Q5_0     3.7 GFLOPS (128 runs) | Q5_1     4.2 GFLOPS (128 runs) | Q8_0     4.1 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      4.1 GFLOPS (128 runs)
 128 x  128: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
 128 x  128: Q5_0     0.9 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     0.9 GFLOPS (128 runs)
 128 x  128: F16      0.9 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 256 x  256: Q4_0     6.3 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
 256 x  256: Q5_0     6.4 GFLOPS (128 runs) | Q5_1     6.3 GFLOPS (128 runs) | Q8_0     6.4 GFLOPS (128 runs)
 256 x  256: F16      6.4 GFLOPS (128 runs) | F32      6.5 GFLOPS (128 runs)
 512 x  512: Q4_0    19.7 GFLOPS ( 74 runs) | Q4_1    20.4 GFLOPS ( 76 runs)
 512 x  512: Q5_0    23.7 GFLOPS ( 89 runs) | Q5_1    23.5 GFLOPS ( 89 runs) | Q8_0    23.7 GFLOPS ( 89 runs)
 512 x  512: F16     24.0 GFLOPS ( 90 runs) | F32     25.3 GFLOPS ( 95 runs)
1024 x 1024: Q4_0    35.5 GFLOPS ( 17 runs) | Q4_1    36.5 GFLOPS ( 17 runs)
1024 x 1024: Q5_0    38.9 GFLOPS ( 19 runs) | Q5_1    39.1 GFLOPS ( 19 runs) | Q8_0    38.7 GFLOPS ( 19 runs)
1024 x 1024: F16     39.3 GFLOPS ( 19 runs) | F32     40.9 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    52.8 GFLOPS (  4 runs) | Q4_1    55.4 GFLOPS (  4 runs)
2048 x 2048: Q5_0    56.8 GFLOPS (  4 runs) | Q5_1    55.6 GFLOPS (  4 runs) | Q8_0    56.5 GFLOPS (  4 runs)
2048 x 2048: F16     56.1 GFLOPS (  4 runs) | F32     56.4 GFLOPS (  4 runs)
4096 x 4096: Q4_0    55.3 GFLOPS (  3 runs) | Q4_1    56.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    58.9 GFLOPS (  3 runs) | Q5_1    60.0 GFLOPS (  3 runs) | Q8_0    61.4 GFLOPS (  3 runs)
4096 x 4096: F16     59.3 GFLOPS (  3 runs) | F32     60.4 GFLOPS (  3 runs)

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 4, 2023

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

I think this is where we benefit from ArmV8.2 and being a subgroup of Apple Silicon first-class citizen - optimized via ARM NEON.
If you do a lscpu
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
So I guess we benefit that GGML is optimised aroung V8.2+ architecture
What should be interesting with https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast is that the GPU on the Pi5 & Rk3588(s) should be able to use OpenCL but in testing I am finding that the same and wondering if that is also similar.
I never worked out if its due to the serial nature of Whisper that you will only get a speedup if the GPU is faster than the CPU but on testing I get a huge slow down whilst in other ML tests the supposed FP32 610.6 GFLOPS of the mali G610 works mightily at approx 75% of the CPU with ArmNN tests using the GPU Tflite OpenCL delegate.
I am presuming CLBlast is somewhat similar and may not be well optimised for some data types?

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.
Not too sure about that as likely the same commit would have to be tested as seem to remember thinking RK3588s was < 5x Pi4 and likely due to memory bandwidth, quite a bit faster than a Pi5.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor

memcpy: 10.50 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.8 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     3.5 GFLOPS (128 runs)
  64 x   64: F16      3.4 GFLOPS (128 runs) | F32      3.4 GFLOPS (128 runs)
 128 x  128: Q4_0     7.9 GFLOPS (128 runs) | Q4_1     8.1 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.9 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1    11.1 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.5 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.5 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.7 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.8 GFLOPS ( 33 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.4 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.2 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    32.2 GFLOPS ( 15 runs) | Q4_1    33.2 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    24.9 GFLOPS ( 12 runs) | Q5_1    25.7 GFLOPS ( 12 runs) | Q8_0    35.2 GFLOPS ( 17 runs)
1024 x 1024: F16     38.0 GFLOPS ( 18 runs) | F32     27.5 GFLOPS ( 13 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    59.5 GFLOPS (  4 runs)
2048 x 2048: Q5_0    38.0 GFLOPS (  3 runs) | Q5_1    39.3 GFLOPS (  3 runs) | Q8_0    64.3 GFLOPS (  4 runs)
2048 x 2048: F16     77.9 GFLOPS (  5 runs) | F32     38.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.4 GFLOPS (  3 runs) | Q4_1    64.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.9 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     37.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  853.56 |    7.37 |  161.81 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 1847.86 |   13.00 |  338.18 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 6289.17 |   39.19 | 1109.25 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   67.99 | 3454.96 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  107.50 | 6541.15 | f96e1c5 |

Linux raspberrypi 6.1.0-rpi4-rpi-2712 Rpi5 4GB performance governor

memcpy: 6.03 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.7 GFLOPS (128 runs) | Q4_1     5.5 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     5.1 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
  64 x   64: F16      5.6 GFLOPS (128 runs) | F32      5.7 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs)
 128 x  128: Q5_0    12.3 GFLOPS (128 runs) | Q5_1    11.8 GFLOPS (128 runs) | Q8_0    11.3 GFLOPS (128 runs)
 128 x  128: F16     15.4 GFLOPS (128 runs) | F32     26.5 GFLOPS (128 runs)
 256 x  256: Q4_0    49.7 GFLOPS (128 runs) | Q4_1    50.3 GFLOPS (128 runs)
 256 x  256: Q5_0    41.8 GFLOPS (128 runs) | Q5_1    39.0 GFLOPS (128 runs) | Q8_0    59.7 GFLOPS (128 runs)
 256 x  256: F16     65.2 GFLOPS (128 runs) | F32     48.7 GFLOPS (128 runs)
 512 x  512: Q4_0    63.0 GFLOPS (128 runs) | Q4_1    63.6 GFLOPS (128 runs)
 512 x  512: Q5_0    50.5 GFLOPS (128 runs) | Q5_1    47.3 GFLOPS (128 runs) | Q8_0    77.7 GFLOPS (128 runs)
 512 x  512: F16     85.6 GFLOPS (128 runs) | F32     53.3 GFLOPS (128 runs)
1024 x 1024: Q4_0    68.1 GFLOPS ( 32 runs) | Q4_1    69.8 GFLOPS ( 33 runs)
1024 x 1024: Q5_0    54.1 GFLOPS ( 26 runs) | Q5_1    51.2 GFLOPS ( 24 runs) | Q8_0    86.0 GFLOPS ( 41 runs)
1024 x 1024: F16     93.6 GFLOPS ( 44 runs) | F32     49.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    70.8 GFLOPS (  5 runs) | Q4_1    72.8 GFLOPS (  5 runs)
2048 x 2048: Q5_0    56.1 GFLOPS (  4 runs) | Q5_1    53.0 GFLOPS (  4 runs) | Q8_0    88.1 GFLOPS (  6 runs)
2048 x 2048: F16     93.7 GFLOPS (  6 runs) | F32     44.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.6 GFLOPS (  3 runs) | Q4_1    74.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    56.2 GFLOPS (  3 runs) | Q5_1    53.3 GFLOPS (  3 runs) | Q8_0    88.4 GFLOPS (  3 runs)
4096 x 4096: F16     86.7 GFLOPS (  3 runs) | F32     39.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 | 1049.00 |    6.74 |  149.32 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 2362.92 |   12.60 |  361.37 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 8081.87 |   35.65 | 1283.34 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |  105.77 | 4360.80 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  189.93 | 8158.78 | f96e1c5 |

I dunno to be honest why Gflops is higher but whilst the Enc the biggest chunk of process faster, maybe mem bandwidth?
Its like4like with the perf governor, due to pref of running Whisper that way of race-till-idle.

@nickovs
Copy link

nickovs commented Nov 4, 2023

@StuartIanNaylor Here is a straight up comparison of the same 54c978c commit between the Pi4 and the Pi5, both running the code compiled on the Pi4 on the Pi5 and then also recompiling the same commit on the Pi5.

Model Pi4   Pi4 code on Pi5   Speedup on same compilation   Recompiled on Pi5   Speedup on recompiled code
  Encode Decode Encode Decode Encode Decode Encode Decode Encode
tiny 5246.14 510.57 2694.38 188.38 1.95 2.71 1106.11 183.67 4.74
tiny.en 5264.76 551.17 2744.80 203.94 1.92 2.70 1109.66 201.3 4.74
base.en 12473.07 1004.23 6345.28 363.15 1.97 2.77 2479.82 346.65 5.03
base 12453.04 972.29 6399.54 348.33 1.95 2.79 2465.12 363.86 5.05
small.en 48849.9 3316.15 24127.58 961.75 2.02 3.45 8308.3 963.24 5.88
small 49671.25 2953 24134.46 1109.70 2.06 2.66 8342.25 1119.25 5.95
medium.en 169889.39 8451.51 79045.66 2815.81 2.15 3.00 26407.77 2893.55 6.43
medium 173236.92 8531.94 79075.19 2836.38 2.19 3.01 26468.86 2919.43 6.54

This suggests that there is a little better than a 2-fold performance improvement on encode, and more like a 2.8 fold improvement on decode, just moving the code from the Pi4 to the Pi5. Recompiling on the Pi5 raises the encode performance to between 4.74 and 6.54 times faster that on the Pi4, but the decode performance remains only about 2.8 times faster than the Pi4 and doesn't benefit a great deal from the recompilation.

(Note that this table hits GitHub's 10 column limit, so the decode speedup may not be displayed, but the numbers are in the comment source.)

The key thing here as far as I'm concerned is that on the Pi5 the small model runs in better than real time, whereas on the Pi4 you were stuck using the tiny model for real-time work.

@jwinarske
Copy link

It would be great to have a test results db for this. I'm thinking similar to what DRM info does

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 5, 2023

@jwinarske that would be great as maybe a seperate repo of fixed commits as we are not benching the software but the hardware.
The Llama bench would be a good inclusion as the openLlama3b-q4 manages 20 Tokens/s on a Rk3588s-4gb.
I also like https://github.com/Tencent/ncnn/tree/master/benchmark as a pretty easy install and has a ready made list of smaller yolo type models.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor 54c978c

memcpy: 11.18 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.7 GFLOPS (128 runs) | Q5_1     2.8 GFLOPS (128 runs) | Q8_0     3.1 GFLOPS (128 runs)
  64 x   64: F16      3.3 GFLOPS (128 runs) | F32      3.2 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     8.0 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      9.5 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.6 GFLOPS (128 runs) | Q4_1    11.0 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.4 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.8 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.8 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.5 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.6 GFLOPS ( 36 runs)
1024 x 1024: Q4_0    32.7 GFLOPS ( 16 runs) | Q4_1    33.3 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    25.2 GFLOPS ( 12 runs) | Q5_1    27.0 GFLOPS ( 13 runs) | Q8_0    36.0 GFLOPS ( 17 runs)
1024 x 1024: F16     39.4 GFLOPS ( 19 runs) | F32     28.1 GFLOPS ( 14 runs)
2048 x 2048: Q4_0    58.2 GFLOPS (  4 runs) | Q4_1    60.0 GFLOPS (  4 runs)
2048 x 2048: Q5_0    37.2 GFLOPS (  3 runs) | Q5_1    38.8 GFLOPS (  3 runs) | Q8_0    63.3 GFLOPS (  4 runs)
2048 x 2048: F16     78.3 GFLOPS (  5 runs) | F32     38.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.9 GFLOPS (  3 runs) | Q4_1    64.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.6 GFLOPS (  3 runs) | Q5_1    41.5 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     35.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  885.27 |    7.35 |  166.54 | 54c978c |
| <todo> | <todo> |             NEON |        base |   4 | 1888.93 |   12.61 |  347.61 | 54c978c |
| <todo> | <todo> |             NEON |       small |   4 | 6397.88 |   38.49 | 1111.82 | 54c978c |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   68.98 | 3511.72 | 54c978c |

@nickovs Dunno as before as A76 gets vector mat/mul and the code is optimised for ArmV8,2+ that the poor Pi4 with openBlas was approx < 5 times slower than a RK3588s.
The above is just same commit on a Opi5-4gb so Zram and swap comes into play with bigger models but from audio in to txt out last time I pegged the Pi4 as approx just less than x5 and ignored models it didn't manage in realtime.
I guess further optimisations have happened, the decode is less important to overall time or the Enc as that is the biggest process.

(venv) pi@raspberrypi:~/llama.cpp $ ./llama-bench -m  models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.77 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      5.42 ± 0.00 |

build: c41ea36 (1487)

ubuntu@ubuntu:~/llama.cpp$ ./llama-bench -m models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.14 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      7.06 ± 0.05 |

ryanrapp pushed a commit to ryanrapp/attentional-ios that referenced this issue Jan 9, 2024
"lib" is needed for windows.

With this change, you can build whisper.cpp with OpenBLAS's prebuilt DLL.
1. extract a zip from https://github.com/xianyi/OpenBLAS/releases
2. copy the headers in (openblas)/include to the root directory of whisper.cpp
3. invoke cmake with -DCMAKE_LIBRARY_PATH=(openblas)\lib -DWHISPER_SUPPORT_OPENBLAS=ON
4. copy (openblas)/bin/libopenblas.dll to the same directory of whisper.dll after msbuild

ggerganov/whisper.cpp#89 (comment)
kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024
@petterreinholdtsen
Copy link
Contributor

Here is the result for NVIDIA GeForce GT 755M on Debian GNU/Linux 12 Bookworm using GCC 12.2.0 build with -DWHISPER_CLBLAST=ON:

whisper_init_from_file_with_params_no_state: loading model from '../nb-large-ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce GT 755M'
ggml_opencl: device FP16 support: false
whisper_model_load:      CPU buffer size =  3094.86 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.42 MB
whisper_init_state: compute buffer (encode) =  212.42 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =   99.24 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

whisper_print_timings:     load time =   712.98 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 29405.07 ms /     1 runs (29405.07 ms per run)
whisper_print_timings:   decode time = 25138.65 ms /   256 runs (   98.20 ms per run)
whisper_print_timings:   batchd time = 15522.25 ms /   320 runs (   48.51 ms per run)
whisper_print_timings:   prompt time = 120379.20 ms /  4096 runs (   29.39 ms per run)
whisper_print_timings:    total time = 190447.95 ms

@zhouwg
Copy link
Contributor

zhouwg commented Mar 6, 2024

benchmark result with 11th Gen Intel Core(TM) i7-11700F @ 2.50GHz + Ubuntu 20.04 + gcc version 9.4.0


CPU OS Config Mode Threads Load [ms] Encode [ms]
i7-11700F Ubuntu 20.04 tiny.en 4 46.72 4654.39
i7-11700F Ubuntu 20.04 tiny.en 8 49.85 2981.43
i7-11700F Ubuntu 20.04 small.en 4 175.02 51381.51
i7-11700F Ubuntu 20.04 small.en 8 161.98 29662.80

./bench  -m ./models/ggml-small.en.bin -t 8 -w 2
  64 x   64: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.4 GFLOPS (128 runs)
  64 x   64: Q5_0     4.0 GFLOPS (128 runs) | Q5_1     3.5 GFLOPS (128 runs) | Q8_0     4.7 GFLOPS (128 runs)
  64 x   64: F16      4.2 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    15.3 GFLOPS (128 runs)
 128 x  128: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    12.3 GFLOPS (128 runs) | Q8_0    21.0 GFLOPS (128 runs)
 128 x  128: F16     11.1 GFLOPS (128 runs) | F32      8.7 GFLOPS (128 runs)
 256 x  256: Q4_0    25.4 GFLOPS (128 runs) | Q4_1    29.1 GFLOPS (128 runs)
 256 x  256: Q5_0    17.4 GFLOPS (128 runs) | Q5_1    18.7 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     13.8 GFLOPS (128 runs) | F32     10.4 GFLOPS (128 runs)
 512 x  512: Q4_0    31.1 GFLOPS (116 runs) | Q4_1    33.0 GFLOPS (124 runs)
 512 x  512: Q5_0    17.1 GFLOPS ( 64 runs) | Q5_1    20.5 GFLOPS ( 77 runs) | Q8_0    66.3 GFLOPS (128 runs)
 512 x  512: F16     14.0 GFLOPS ( 53 runs) | F32      9.3 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    31.9 GFLOPS ( 16 runs) | Q4_1    31.0 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    20.0 GFLOPS ( 10 runs) | Q5_1    22.9 GFLOPS ( 11 runs) | Q8_0    80.1 GFLOPS ( 38 runs)
1024 x 1024: F16     14.6 GFLOPS (  7 runs) | F32      8.9 GFLOPS (  5 runs)
2048 x 2048: Q4_0    35.9 GFLOPS (  3 runs) | Q4_1    40.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    21.2 GFLOPS (  3 runs) | Q5_1    23.6 GFLOPS (  3 runs) | Q8_0    88.0 GFLOPS (  6 runs)
2048 x 2048: F16     14.4 GFLOPS (  3 runs) | F32      8.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    35.4 GFLOPS (  3 runs) | Q4_1    39.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    20.0 GFLOPS (  3 runs) | Q5_1    21.2 GFLOPS (  3 runs) | Q8_0    85.0 GFLOPS (  3 runs)
4096 x 4096: F16     13.5 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

./bench  -m ./models/ggml-small.en.bin -t 8 -w 1
memcpy:    9.43 GB/s (heat-up)
memcpy:    9.31 GB/s ( 1 thread)
memcpy:    9.15 GB/s ( 1 thread)
memcpy:    8.74 GB/s ( 2 thread)
memcpy:    8.67 GB/s ( 3 thread)
memcpy:    8.43 GB/s ( 4 thread)
memcpy:    8.42 GB/s ( 5 thread)
memcpy:    8.70 GB/s ( 6 thread)
memcpy:    8.63 GB/s ( 7 thread)
memcpy:    8.32 GB/s ( 8 thread)
sum:    -5119997019.000000
 ./bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     9.4 GFLOPS (128 runs)
  64 x   64: F16      6.2 GFLOPS (128 runs) | F32      2.4 GFLOPS (128 runs)
 128 x  128: Q4_0    15.4 GFLOPS (128 runs) | Q4_1    16.6 GFLOPS (128 runs)
 128 x  128: Q5_0    10.6 GFLOPS (128 runs) | Q5_1    11.5 GFLOPS (128 runs) | Q8_0    25.9 GFLOPS (128 runs)
 128 x  128: F16      9.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 256 x  256: Q4_0    19.9 GFLOPS (128 runs) | Q4_1    22.8 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.9 GFLOPS (128 runs) | Q8_0    44.2 GFLOPS (128 runs)
 256 x  256: F16      9.4 GFLOPS (128 runs) | F32      7.6 GFLOPS (128 runs)
 512 x  512: Q4_0    21.7 GFLOPS ( 81 runs) | Q4_1    23.0 GFLOPS ( 86 runs)
 512 x  512: Q5_0    12.9 GFLOPS ( 48 runs) | Q5_1    13.9 GFLOPS ( 52 runs) | Q8_0    48.6 GFLOPS (128 runs)
 512 x  512: F16      8.9 GFLOPS ( 34 runs) | F32      6.8 GFLOPS ( 26 runs)
1024 x 1024: Q4_0    22.1 GFLOPS ( 11 runs) | Q4_1    24.9 GFLOPS ( 12 runs)
1024 x 1024: Q5_0    13.1 GFLOPS (  7 runs) | Q5_1    14.0 GFLOPS (  7 runs) | Q8_0    53.4 GFLOPS ( 25 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      6.5 GFLOPS (  4 runs)
2048 x 2048: Q4_0    22.6 GFLOPS (  3 runs) | Q4_1    25.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    13.1 GFLOPS (  3 runs) | Q5_1    14.7 GFLOPS (  3 runs) | Q8_0    57.1 GFLOPS (  4 runs)
2048 x 2048: F16      8.7 GFLOPS (  3 runs) | F32      6.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    21.5 GFLOPS (  3 runs) | Q4_1    23.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    12.2 GFLOPS (  3 runs) | Q5_1    13.5 GFLOPS (  3 runs) | Q8_0    53.9 GFLOPS (  3 runs)
4096 x 4096: F16      8.0 GFLOPS (  3 runs) | F32      5.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
i7-11700F Ubuntu 20.04 base 4 ms 15.82 15.05 15.71 31989a5a

there is an impressive benchmark result(compare to above bench result in PC which was purchased by RMB12000(about USD 1700) a few years ago) with Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) + Xiaomi's HyperOS(derived from Android 14) + Android NDK r21e:

2106054156

updated on 03-20-2024, Xiaomi 14 + Android NDK r26c( NDK r26c is required for special build optimization:https://github.com/cdeos/kantv/blob/master/external/whispercpp/CMakeLists.txt#L60)

514487122

1755516838

30679635

@obeone
Copy link

obeone commented Apr 24, 2024

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL tiny 4 34.15 1.45 0.47 0.03 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL base 4 59.32 2.27 0.79 0.05 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL small 4 200.45 5.50 1.75 0.15 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL medium 4 534.54 12.88 3.90 0.37 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL large-v1 4 989.45 22.29 6.58 0.64 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL large-v2 4 962.34 22.38 6.61 0.64 858452d
Macbook M3 Pro Sonoma 14.5 NEON BLAS METAL large-v3 4 969.27 22.23 6.59 0.64 858452d

@nanocosmos-ol
Copy link

Different results for different code commits - older version is much faster!

CPU: AMD Ryzen 9 7950X3D 16-Core

  • commit 858452d Date: Wed Apr 24 14:56:30 2024 +0300

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

whisper_print_timings: load time = 64.61 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 878.59 ms / 1 runs ( 878.59 ms per run)
whisper_print_timings: decode time = 935.20 ms / 256 runs ( 3.65 ms per run)
whisper_print_timings: batchd time = 544.69 ms / 320 runs ( 1.70 ms per run)
whisper_print_timings: prompt time = 3865.51 ms / 4096 runs ( 0.94 ms per run)
whisper_print_timings: total time = 6225.76 ms

  • commit d03c60d Date: Wed Nov 8 04:53:31 2023 +0700

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

whisper_print_timings: load time = 83.24 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 693.48 ms / 1 runs ( 693.48 ms per run)
whisper_print_timings: decode time = 874.80 ms / 256 runs ( 3.42 ms per run)
whisper_print_timings: prompt time = 2249.08 ms / 16 runs ( 140.57 ms per run)
whisper_print_timings: total time = 3817.54 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance CPU and memory usage - results and comparisons
Projects
None yet
Development

No branches or pull requests