Benchmark results #89

ggerganov · 2022-10-25T17:05:10Z

Encoder

Collection of bench results for various platforms and devices.
If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	`206fc93`
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	1	251	2605	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	255	884	`206fc93`
---
Mac Mini M1	MacOS	NEON BLAS	tiny	4	62	194	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	base	4	81	380	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	small	4	204	1249	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	medium	4	876	3980	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	large	4	1876	7979	`fcf515d`
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	8	107	422	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	8	137	880	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	8	280	2874	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	8	692	9610	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	8	1317	16917	`fcf515d`
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	tiny	4	120	780	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	base	4	151	1173	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	small	4	289	3062	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	medium	4	711	9175	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	large	4	1282	16050	`fcf515d`
---
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	8	135	197	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	8	176	421	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	8	357	1393	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	8	855	4404	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	8	1576	8118	`fcf515d`
---
Raspberry Pi 4		NEON	tiny	4	1436	13839	`fcf515d`
Raspberry Pi 4		NEON	base	4	1894	30552	`fcf515d`
---
iPhone 13 Mini	iOS 16.0	NEON BLAS	base	4	97	1091	`fcf515d`
---
MacBook M1 Pro	Vivaldi	WASM	tiny	8	133	3785	`fcf515d`
MacBook M1 Pro	Vivaldi	WASM	base	8	172	8253	`fcf515d`
---
MacBook M1 Pro	Chrome	WASM	tiny	8	134	3776	`fcf515d`
MacBook M1 Pro	Chrome	WASM	base	8	168	8200	`fcf515d`
---
MacBook M1 Pro	Firefox	WASM	tiny	8	137	2626	`fcf515d`
MacBook M1 Pro	Firefox	WASM	base	8	183	6226	`fcf515d`

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)

The text was updated successfully, but these errors were encountered:

cdosoftei · 2022-10-25T18:50:27Z

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
i7-4790K	Debian	tiny.en	4	165	808
i7-4790K	Debian	tiny.en	8	165	783
i7-4790K	Debian	base.en	4	212	1813
i7-4790K	Debian	base.en	8	214	1746

rjwilmsi · 2022-10-26T12:46:21Z

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	4	170.00	829.43
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	6	143.03	671.74
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	4	305.92	2,092.39
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	6	188.05	1,495.61
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	4	408.03	6,919.31
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	6	359.23	6,370.83
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	4	2,238.11	25,863.28
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	6	1,113.04	19,672.63
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	8	973.65	39,619.20

ArtyomZemlyak · 2022-10-26T15:25:15Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	tiny	2	164.35	1087.61
i7-11800H	WSL2 Ubuntu	AVX2	tiny	4	128.94	733.24
i7-11800H	WSL2 Ubuntu	AVX2	tiny	8	137.57	619.88
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	2	143.02	1087.15
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	4	127.60	730.57
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	8	125.62	616.27
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	2	132.59	1511.38
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	4	132.48	1407.49
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	8	133.82	1458.27

ArtyomZemlyak · 2022-10-26T15:35:06Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	base	2	174.34	2533.79
i7-11800H	WSL2 Ubuntu	AVX2	base	4	166.68	1830.67
i7-11800H	WSL2 Ubuntu	AVX2	base	8	165.53	1478.73
i7-11800H	WSL2 Ubuntu	AVX2	small	2	340.12	8714.24
i7-11800H	WSL2 Ubuntu	AVX2	small	4	394.32	6021.41
i7-11800H	WSL2 Ubuntu	AVX2	small	8	305.98	4828.84
i7-11800H	WSL2 Ubuntu	AVX2	large	2	3205.36	57109.10
i7-11800H	WSL2 Ubuntu	AVX2	large	4	2720.25	38519.89
i7-11800H	WSL2 Ubuntu	AVX2	large	8	3716.34	27739.99

ArtyomZemlyak · 2022-10-26T15:41:21Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	2	1954.21	54966.84
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	4	1455.40	37320.62
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	8	1372.58	27937.64

ArtyomZemlyak · 2022-10-26T15:44:27Z

This performance is impressing!

M1 Pro | MacOS | | large | 8 | 1973 | 4208

ggerganov · 2022-10-26T19:32:12Z

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: #95

cristianglezm · 2022-10-28T20:45:56Z

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
Intel® Core™ i5-8250U	Win11 Home	AVX2	Large	8	2226.85	61547.61

compiled with MinGW64 gcc 11.3

tazz4843 · 2022-10-29T00:06:50Z

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
AMD Custom APU 0405	SteamOS 3.2	AVX2	Base	8	326.32	2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

yujinqiu · 2022-10-30T00:14:59Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
MacBook M1 Max	macOS Ventura	BLAS	small	1	299.09	4166.00
MacBook M1 Max	macOS Ventura	BLAS	small	4	329.45	1304.32
MacBook M1 Max	macOS Ventura	BLAS	base	1	139.10	1302.17
MacBook M1 Max	macOS Ventura	BLAS	base	4	135.96	399.45

ggerganov · 2022-10-31T17:45:02Z

@trholding
Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

trholding · 2022-10-31T22:55:48Z

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

rgerganov · 2022-11-05T06:43:35Z

Dell Precision 5560 laptop results:

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11850H	Ubuntu	AVX2	tiny	4	115.87	538.43
i7-11850H	Ubuntu	AVX2	base	4	145.14	1241.84
i7-11850H	Ubuntu	AVX2	small	4	299.30	4343.57
i7-11850H	Ubuntu	AVX2	medium	4	760.98	15238.31
i7-11850H	Ubuntu	AVX2	large	4	1404.32	27476.86
i7-11850H	Ubuntu	AVX2	tiny	8	131.96	358.81
i7-11850H	Ubuntu	AVX2	base	8	166.61	839.31
i7-11850H	Ubuntu	AVX2	small	8	320.29	2854.86
i7-11850H	Ubuntu	AVX2	medium	8	756.20	9829.62
i7-11850H	Ubuntu	AVX2	large	8	1382.38	19872.81

jaybinks · 2022-11-05T10:34:15Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	tiny.en	4	85.71	601.56
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	small.en	4	212.59	5146.23
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	tiny.en	4	198.17	455.12
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	base.en	4	272.62	909.71
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	small.en	4	598.75	2968.75
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	small.en	4	776.56	12340.41
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	tiny.en	4	295.54	1710.46

mark-beeby · 2022-11-08T09:18:49Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	4	124.28	656.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	8	123.70	696.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	4	159.91	1754.44
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	8	164.47	1658.55
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	4	330.91	6161.86
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	8	346.22	5187.85

niksedk · 2022-11-09T19:57:02Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	-	small.en	4	1,314.25	294,168.09

Compiled with VS 2022

Something is off, right?

ggerganov · 2022-11-09T20:13:11Z

Yup - you are missing the AVX2 flag. See if some of the comments in #5 can help you resolve this.

niksedk · 2022-11-09T20:33:55Z

OK, the AVX2 flag seems to help :)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	AVX2	small.en	4	527.59	9,648.67

Compiled with VS 2022

j1nx · 2022-11-17T11:02:17Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Remarks
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	861.34	29428.21	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	843.80	16145.62	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	835.68	21509.08	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	824.24	13187.96	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	1146.02	87615.00	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	1103.39	52228.30	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	1183.47	55256.20	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	1161.32	29851.40	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	752.64	24018.10	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	751.96	13082.95	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	743.37	10122.80	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	742.90	9564.89	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	974.46	71587.61	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	979.65	43852.07	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	982.24	24814.62	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	982.80	19910.19	Without OVOS services running

StuartIanNaylor · 2022-11-17T11:37:39Z

From the stream repo

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	243.54 ms	779.49 ms
RK3588	Ubuntu20.04	NEON	base.en	4	316.52 ms	1821.06 ms
RK3588	Ubuntu20.04	NEON	small.en	4	618.93 ms	7117.69 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1514.88 ms	24139.92 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	4	233.86 ms	791.01 ms
RK3588	Ubuntu20.04	NEON	base	4	297.93 ms	1813.69 ms
RK3588	Ubuntu20.04	NEON	small	4	592.18 ms	7102.28 ms
RK3588	Ubuntu20.04	NEON	medium	4	1587.36 ms	24147.87 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	740.34 ms
RK3588	Ubuntu20.04	NEON	base	8	300.48 ms	1723.42 ms
RK3588	Ubuntu20.04	NEON	small	8	620.58 ms	6392.47 ms
RK3588	Ubuntu20.04	NEON	medium	8	1533.75 ms	21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	234.14 ms	681.53 ms
RK3588	Ubuntu20.04	NEON	base.en	4	297.08 ms	1679.75 ms
RK3588	Ubuntu20.04	NEON	small.en	4	599.98 ms	6867.66 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1492.73 ms	23600.45 ms

I tried to compile with openBlas but seemed to kill the make

From the master repo as didn't think about the repo after trying streaming input

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	2681.05 ms
RK3588	Ubuntu20.04	NEON	base	8	283.56 ms	6132.44 ms
RK3588	Ubuntu20.04	NEON	small	8	583.39 ms	24397.78 ms
RK3588	Ubuntu20.04	NEON	medium	8	1490.98	85099.45 ms

dodysw · 2022-11-17T12:06:04Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny.en	8	136.29	454.52
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	8	134.64	486.01
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	8	180.22	1184.80
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base.en	8	192.86	1197.85
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	8	367.55	4179.00
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small.en	8	378.27	4557.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	8	923.48	15552.61
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium.en	8	952.48	15708.63
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	large	8	1650.28	28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	16	143.17	437.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	16	184.10	1061.14
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	16	374.41	3645.64
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	16	935.45	13029.54

matth · 2022-11-21T16:20:26Z

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	8	125.92	230.33
Graviton 3	Ubuntu 22.04	NEON	base	8	160.17	547.88
Graviton 3	Ubuntu 22.04	NEON	small	8	299.59	2138.86
Graviton 3	Ubuntu 22.04	NEON	medium	8	741.49	6999.33
Graviton 3	Ubuntu 22.04	NEON	large	8	1313.95	14174.00

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	121.92	158.61
Graviton 3	Ubuntu 22.04	NEON	base	16	156.01	386.78
Graviton 3	Ubuntu 22.04	NEON	small	16	299.85	1596.38
Graviton 3	Ubuntu 22.04	NEON	medium	16	750.93	5351.24
Graviton 3	Ubuntu 22.04	NEON	large	16	1313.82	11115.69

ggerganov · 2022-11-21T16:25:52Z

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

matth · 2022-11-21T21:16:42Z

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	124.25	320.53
Graviton 3	Ubuntu 22.04	NEON	base	16	156.91	734.22
Graviton 3	Ubuntu 22.04	NEON	small	16	301.78	2812.75
Graviton 3	Ubuntu 22.04	NEON	medium	16	714.23	9139.86
Graviton 3	Ubuntu 22.04	NEON	large	16	1298.33	18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

letsgitcracking · 2023-08-01T13:36:26Z

ThinkPad P1 Gen 4

Intel Core i7-11800H | 64GB RAM | NVIDIA T1200 4GB GPU

Encoder

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	tiny	8	554	233	57543c1
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	base	8	567	507	57543c1
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	small	8	704	1827	57543c1
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	medium	8	1088	5435	57543c1
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	large	8	1648	10413	57543c1
---
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	tiny	8	554	148	4774d2f
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	base	8	577	299	4774d2f
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	small	8	698	1056	4774d2f
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	medium	8	1089	2761	4774d2f
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	large	8	1646	5498	4774d2f
---
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	tiny	8	98	162	a792c40
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	base	8	131	319	a792c40
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	small	8	237	1107	a792c40
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	medium	8	627	2896	a792c40
i7-11800H	Ubuntu 22.04.2 LTS	AVX2 BLAS	large	8	1279	5819	a792c40

memcpy

./bench -w 1 -t 1
memcpy: 18.38 GB/s (1 thread)

ggml_mul_mat

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
64 x   64: F16     15.8 GFLOPS (128 runs) | F32     18.3 GFLOPS (128 runs)
128 x  128: F16     82.8 GFLOPS (128 runs) | F32     85.3 GFLOPS (128 runs)
256 x  256: F16    208.0 GFLOPS (128 runs) | F32    212.0 GFLOPS (128 runs)
512 x  512: F16    523.1 GFLOPS (128 runs) | F32    490.4 GFLOPS (128 runs)
1024 x 1024: F16    984.3 GFLOPS (128 runs) | F32    940.6 GFLOPS (128 runs)
2048 x 2048: F16   1256.0 GFLOPS ( 74 runs) | F32   1232.0 GFLOPS ( 72 runs)
4096 x 4096: F16   1390.2 GFLOPS ( 11 runs) | F32   1380.1 GFLOPS ( 11 runs)

marty1885 · 2023-08-01T14:32:03Z

Me and my friends did some benchmark and profiling. Apparently llama.cpp only uses the GPU for matrix multiplication. And currently the major time on CPU being Conv1D and softmax. So that'll be a good starting point to improve the performance on CUDA/OpenCL.

vieenrose · 2023-08-06T04:51:34Z

Jetson Nano 4GB using cuBLAS

Compiler version

gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Memcpy

Running memcpy benchmark

memcpy: 3.18 GB/s (1 thread)
sum: 136902081526.000000

ggml_mul_mat

Running ggml_mul_mat benchmark with 4 threads

ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA Tegra X1
64 x 64: Q4_0 0.1 GFLOPS (128 runs) | Q4_1 64 x 64: Q5_0 0.1 GFLOPS (103 runs) | Q5_1 64 x 64: F16 0.1 GFLOPS (128 runs) | F32 128 x 128: Q4_0 0.4 GFLOPS (107 runs) | Q4_1 128 x 128: Q5_0 0.3 GFLOPS ( 78 runs) | Q5_1 128 x 128: F16 0.4 GFLOPS (100 runs) | F32 256 x 256: Q4_0 2.8 GFLOPS ( 85 runs) | Q4_1 256 x 256: Q5_0 2.6 GFLOPS ( 78 runs) | Q5_1 256 x 256: F16 2.5 GFLOPS ( 74 runs) | F32 512 x 512: Q4_0 12.5 GFLOPS ( 47 runs) | Q4_1 512 x 512: Q5_0 14.4 GFLOPS ( 54 runs) | Q5_1 512 x 512: F16 15.2 GFLOPS ( 57 runs) | F32 1024 x 1024: Q4_0 66.9 GFLOPS ( 32 runs) | Q4_1 1024 x 1024: Q5_0 76.0 GFLOPS ( 36 runs) | Q5_1 1024 x 1024: F16 84.5 GFLOPS ( 40 runs) | F32 2048 x 2048: Q4_0 150.3 GFLOPS ( 9 runs) | Q4_1 2048 x 2048: Q5_0 143.8 GFLOPS ( 9 runs) | Q5_1 2048 x 2048: F16 137.4 GFLOPS ( 8 runs) | F32 4096 x 4096: Q4_0 177.8 GFLOPS ( 3 runs) | Q4_1 4096 x 4096: Q5_0 176.5 GFLOPS ( 3 runs) | Q5_1 4096 x 4096: F16 166.8 GFLOPS ( 3 runs) | F32 0.1 GFLOPS (116 runs)
0.1 GFLOPS (128 runs) | Q8_0 0.1 GFLOPS (128 runs)
0.1 GFLOPS (122 runs)
0.4 GFLOPS ( 91 runs)
0.4 GFLOPS ( 90 runs) | Q8_0 0.4 GFLOPS ( 91 runs)
0.4 GFLOPS ( 86 runs)
2.7 GFLOPS ( 81 runs)
2.7 GFLOPS ( 83 runs) | Q8_0 2.6 GFLOPS ( 77 runs)
2.7 GFLOPS ( 81 runs)
13.7 GFLOPS ( 52 runs)
14.4 GFLOPS ( 54 runs) | Q8_0 14.6 GFLOPS ( 55 runs)
17.5 GFLOPS ( 66 runs)
76.9 GFLOPS ( 36 runs)
75.8 GFLOPS ( 36 runs) | Q8_0 76.2 GFLOPS ( 36 runs)
87.5 GFLOPS ( 41 runs)
143.3 GFLOPS ( 9 runs)
133.2 GFLOPS ( 8 runs) | Q8_0 132.8 GFLOPS ( 8 runs)
135.6 GFLOPS ( 8 runs)
168.4 GFLOPS ( 3 runs)
173.4 GFLOPS ( 3 runs) | Q8_0 173.9 GFLOPS ( 3 runs)
172.6 GFLOPS ( 3 runs)

Model benchmark

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Jetson Nano	JetPack 4.5.1	NEON BLAS	tiny	4	1136	1998	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	tiny-q5_0	4	1118	2131	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	base	4	1176	3450	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	base-q5_0	4	1144	3711	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	small	4	19671	10162	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	small-q5_0	4	1209	9603	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	medium	4	67108	28672	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	medium-q5_0	4	1545	26479	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	tiny	1	1138	3306	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	tiny-q5_0	1	1132	3112	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	base	1	1172	5618	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	base-q5_0	1	1102	5516	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	small	1	1568	14845	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	small-q5_0	1	1225	14472	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	medium	1	66915	38635	`a4bb2df`
Jetson Nano	JetPack 4.5.1	NEON BLAS	medium-q5_0	1	1590	37408	`a4bb2df`

Running memcpy benchmark

memcpy: 3.21 GB/s (heat-up)
memcpy: 3.23 GB/s ( 1 thread)
memcpy: 3.21 GB/s ( 1 thread)
memcpy: 4.14 GB/s ( 2 thread)
memcpy: 4.56 GB/s ( 3 thread)
memcpy: 4.83 GB/s ( 4 thread)
sum: 783359998033.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA Tegra X1, compute capability 5.3, VMM: no
64 x 64: Q4_0 0.2 GFLOPS (128 runs) | Q4_1 0.3 GFLOPS (128 runs)
64 x 64: Q5_0 0.3 GFLOPS (128 runs) | Q5_1 0.4 GFLOPS (128 runs) | Q8_0 0.5 GFLOPS (128 runs)
64 x 64: F16 0.3 GFLOPS (128 runs) | F32 0.3 GFLOPS (128 runs)
128 x 128: Q4_0 3.4 GFLOPS (128 runs) | Q4_1 3.4 GFLOPS (128 runs)
128 x 128: Q5_0 1.6 GFLOPS (128 runs) | Q5_1 6.1 GFLOPS (128 runs) | Q8_0 5.3 GFLOPS (128 runs)
128 x 128: F16 3.6 GFLOPS (128 runs) | F32 1.5 GFLOPS (128 runs)
256 x 256: Q4_0 13.0 GFLOPS (128 runs) | Q4_1 10.3 GFLOPS (128 runs)
256 x 256: Q5_0 17.0 GFLOPS (128 runs) | Q5_1 12.2 GFLOPS (128 runs) | Q8_0 19.6 GFLOPS (128 runs)
256 x 256: F16 20.5 GFLOPS (128 runs) | F32 11.9 GFLOPS (128 runs)
512 x 512: Q4_0 43.6 GFLOPS (128 runs) | Q4_1 49.0 GFLOPS (128 runs)
512 x 512: Q5_0 52.1 GFLOPS (128 runs) | Q5_1 50.9 GFLOPS (128 runs) | Q8_0 50.0 GFLOPS (128 runs)
512 x 512: F16 49.8 GFLOPS (128 runs) | F32 48.0 GFLOPS (128 runs)
1024 x 1024: Q4_0 80.2 GFLOPS ( 38 runs) | Q4_1 110.6 GFLOPS ( 52 runs)
1024 x 1024: Q5_0 120.7 GFLOPS ( 57 runs) | Q5_1 108.5 GFLOPS ( 51 runs) | Q8_0 123.2 GFLOPS ( 58 runs)
1024 x 1024: F16 112.0 GFLOPS ( 53 runs) | F32 108.3 GFLOPS ( 51 runs)
2048 x 2048: Q4_0 147.9 GFLOPS ( 9 runs) | Q4_1 151.4 GFLOPS ( 9 runs)
2048 x 2048: Q5_0 159.3 GFLOPS ( 10 runs) | Q5_1 141.2 GFLOPS ( 9 runs) | Q8_0 143.1 GFLOPS ( 9 runs)
2048 x 2048: F16 151.4 GFLOPS ( 9 runs) | F32 140.4 GFLOPS ( 9 runs)
4096 x 4096: Q4_0 176.2 GFLOPS ( 3 runs) | Q4_1 180.7 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 180.7 GFLOPS ( 3 runs) | Q5_1 182.2 GFLOPS ( 3 runs) | Q8_0 183.6 GFLOPS ( 3 runs)
4096 x 4096: F16 179.4 GFLOPS ( 3 runs) | F32 174.2 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	tiny	4	11.36	15.47	7.38	0.63	`3d42463`
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	tiny-q5_1	4	11.32	19.04	7.46	0.63	`3d42463`
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	base	4	18.98	25.56	12.59	1.10	`3d42463`
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	base-q5_1	4	18.58	32.62	12.09	1.10	`3d42463`
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	small	4	872.71	70.89	35.06	3.10	`3d42463`
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	small-q5_1	4	871.33	90.27	33.91	3.09	`3d42463`
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	medium	4	6930.15	183.71	95.69	8.64	`3d42463`
Jetson Nano	JetPack 4.5.1	NEON BLAS CUDA	medium-q5_0	4	6889.33	230.09	91.06	8.54	`3d42463`

vieenrose · 2023-08-07T05:44:23Z

Jetson Orin Nano Developper Kit using cuBLAS

Compiler version

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Memcpy

Running memcpy benchmark

memcpy: 6.10 GB/s (1 thread)
sum: 136902081526.000000

ggml_mul_mat

Running ggml_mul_mat benchmark with 6 threads

ggml_init_cublas: found 1 CUDA devices:
Device 0: Orin
64 x 64: Q4_0 0.7 GFLOPS (128 runs) | Q4_1 0.7 GFLOPS (128 runs)
64 x 64: Q5_0 0.6 GFLOPS (128 runs) | Q5_1 0.6 GFLOPS (128 runs) | Q8_0 0.7 GFLOPS (128 runs)
64 x 64: F16 0.7 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs)
128 x 128: Q4_0 4.4 GFLOPS (128 runs) | Q4_1 3.6 GFLOPS (128 runs)
128 x 128: Q5_0 3.2 GFLOPS (128 runs) | Q5_1 3.7 GFLOPS (128 runs) | Q8_0 3.8 GFLOPS (128 runs)
128 x 128: F16 4.0 GFLOPS (128 runs) | F32 3.8 GFLOPS (128 runs)
256 x 256: Q4_0 16.0 GFLOPS (128 runs) | Q4_1 25.7 GFLOPS (128 runs)
256 x 256: Q5_0 25.8 GFLOPS (128 runs) | Q5_1 29.6 GFLOPS (128 runs) | Q8_0 27.6 GFLOPS (128 runs)
256 x 256: F16 30.2 GFLOPS (128 runs) | F32 19.0 GFLOPS (128 runs)
512 x 512: Q4_0 170.9 GFLOPS (128 runs) | Q4_1 167.9 GFLOPS (128 runs)
512 x 512: Q5_0 168.7 GFLOPS (128 runs) | Q5_1 119.6 GFLOPS (128 runs) | Q8_0 134.6 GFLOPS (128 runs)
512 x 512: F16 150.6 GFLOPS (128 runs) | F32 166.4 GFLOPS (128 runs)
1024 x 1024: Q4_0 354.5 GFLOPS (128 runs) | Q4_1 571.7 GFLOPS (128 runs)
1024 x 1024: Q5_0 558.6 GFLOPS (128 runs) | Q5_1 520.5 GFLOPS (128 runs) | Q8_0 514.3 GFLOPS (128 runs)
1024 x 1024: F16 491.5 GFLOPS (128 runs) | F32 505.7 GFLOPS (128 runs)
2048 x 2048: Q4_0 1006.5 GFLOPS ( 59 runs) | Q4_1 1123.9 GFLOPS ( 66 runs)
2048 x 2048: Q5_0 1144.5 GFLOPS ( 67 runs) | Q5_1 1125.0 GFLOPS ( 66 runs) | Q8_0 1102.3 GFLOPS ( 65 runs)
2048 x 2048: F16 1035.8 GFLOPS ( 61 runs) | F32 947.1 GFLOPS ( 56 runs)
4096 x 4096: Q4_0 1735.8 GFLOPS ( 14 runs) | Q4_1 1744.1 GFLOPS ( 13 runs)
4096 x 4096: Q5_0 1613.4 GFLOPS ( 12 runs) | Q5_1 1529.1 GFLOPS ( 12 runs) | Q8_0 1621.6 GFLOPS ( 12 runs)
4096 x 4096: F16 1543.4 GFLOPS ( 12 runs) | F32 1488.8 GFLOPS ( 11 runs)

Model benchmark

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	tiny	6	1399	537	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	tiny-q5_0	6	1401	486	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	base	6	1523	1101	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	base-q5_0	6	1420	992	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	small	6	2735	2861	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	small-q5_0	6	1514	2656	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	medium	6	5639	7836	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	medium-q5_0	6	1820	8267	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	large	6	11349	23323	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	large-q5_0	6	2241	13197	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	tiny	1	1418	1339	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	tiny-q5_0	1	1391	1382	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	base	1	1503	2422	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	base-q5_0	1	1385	2405	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	small	1	2798	6193	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	small-q5_0	1	1635	6175	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	medium	1	6368	14623	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	medium-q5_0	1	1856	14424	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	large	1	11055	25367	`a4bb2df`
Jetson Orin Nano	JetPack 5.1.1	NEON BLAS	large-q5_0	1	2226	24363	`a4bb2df`

vieenrose · 2023-08-07T09:50:58Z

Intel Core i5-8400 CPU + NVIDIA GeForce GTX 1070 with cuBLAS

Compiler version

gcc (conda-forge gcc 10.4.0-19) 10.4.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

Memcpy

Running memcpy benchmark

memcpy: 13.17 GB/s (1 thread)
sum: -536869898.000000

ggml_mul_mat

Running ggml_mul_mat benchmark with 6 threads

ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070
64 x 64: Q4_0 0.2 GFLOPS (128 runs) | Q4_1 0.1 GFLOPS (128 runs)
64 x 64: Q5_0 0.2 GFLOPS (128 runs) | Q5_1 0.2 GFLOPS (128 runs) | Q8_0 0.2 GFLOPS (128 runs)
64 x 64: F16 0.3 GFLOPS (128 runs) | F32 0.2 GFLOPS (128 runs)
128 x 128: Q4_0 1.8 GFLOPS (128 runs) | Q4_1 2.0 GFLOPS (128 runs)
128 x 128: Q5_0 2.2 GFLOPS (128 runs) | Q5_1 2.0 GFLOPS (128 runs) | Q8_0 1.8 GFLOPS (128 runs)
128 x 128: F16 2.0 GFLOPS (128 runs) | F32 2.1 GFLOPS (128 runs)
256 x 256: Q4_0 15.8 GFLOPS (128 runs) | Q4_1 13.3 GFLOPS (128 runs)
256 x 256: Q5_0 15.6 GFLOPS (128 runs) | Q5_1 16.5 GFLOPS (128 runs) | Q8_0 13.3 GFLOPS (128 runs)
256 x 256: F16 16.1 GFLOPS (128 runs) | F32 15.9 GFLOPS (128 runs)
512 x 512: Q4_0 115.3 GFLOPS (128 runs) | Q4_1 92.0 GFLOPS (128 runs)
512 x 512: Q5_0 62.0 GFLOPS (128 runs) | Q5_1 35.4 GFLOPS (128 runs) | Q8_0 41.0 GFLOPS (128 runs)
512 x 512: F16 86.7 GFLOPS (128 runs) | F32 122.5 GFLOPS (128 runs)
1024 x 1024: Q4_0 640.1 GFLOPS (128 runs) | Q4_1 594.0 GFLOPS (128 runs)
1024 x 1024: Q5_0 645.8 GFLOPS (128 runs) | Q5_1 648.0 GFLOPS (128 runs) | Q8_0 545.0 GFLOPS (128 runs)
1024 x 1024: F16 535.3 GFLOPS (128 runs) | F32 428.5 GFLOPS (128 runs)
2048 x 2048: Q4_0 1123.6 GFLOPS ( 66 runs) | Q4_1 848.5 GFLOPS ( 50 runs)
2048 x 2048: Q5_0 1245.0 GFLOPS ( 73 runs) | Q5_1 1795.5 GFLOPS (105 runs) | Q8_0 938.0 GFLOPS ( 55 runs)
2048 x 2048: F16 1363.3 GFLOPS ( 80 runs) | F32 1288.0 GFLOPS ( 75 runs)
4096 x 4096: Q4_0 2804.1 GFLOPS ( 21 runs) | Q4_1 3196.2 GFLOPS ( 24 runs)
4096 x 4096: Q5_0 3149.1 GFLOPS ( 23 runs) | Q5_1 2907.3 GFLOPS ( 22 runs) | Q8_0 2949.2 GFLOPS ( 22 runs)
4096 x 4096: F16 2864.8 GFLOPS ( 21 runs) | F32 2960.5 GFLOPS ( 22 runs)

Model benchmark

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	tiny	6	432	276	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	tiny-q5_0	6	382	198	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	base	6	544	531	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	base-q5_0	6	511	388	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	small	6	780	1659	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	small-q5_0	6	527	1100	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	medium	6	1510	2954	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	medium-q5_0	6	795	3315	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	large	6	2611	5024	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	large-q5_0	6	1205	5231	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	tiny	1	442	397	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	tiny-q5_0	1	406	391	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	base	1	509	744	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	base-q5_0	1	425	738	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	small	1	774	2140	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	small-q5_0	1	516	2050	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	medium	1	1509	5569	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	medium-q5_0	1	795	5611	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	large	1	2630	9155	`a4bb2df`
i5-8400 + GTX 1070	Ubuntu 18.04.6 LTS	AVX2 BLAS	large-q5_0	1	1173	9078	`a4bb2df`

tazz4843 · 2023-09-20T02:17:21Z

Valve Jupiter (AMD Custom APU 0405)

Update of the last time I ran this (last time was CPU only due to being on SteamOS): #89 (comment)

Running memcpy benchmark

memcpy: 12.87 GB/s (1 thread)
sum:    -536869898.000000

CPU only

Running ggml_mul_mat benchmark with 8 threads

  64 x   64: Q4_0     2.3 GFLOPS (128 runs) | Q4_1     2.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.3 GFLOPS (128 runs) | Q5_1     1.4 GFLOPS (128 runs) | Q8_0     2.5 GFLOPS (128 runs)
  64 x   64: F16      2.4 GFLOPS (128 runs) | F32      2.5 GFLOPS (128 runs)
 128 x  128: Q4_0    16.1 GFLOPS (128 runs) | Q4_1    14.7 GFLOPS (128 runs)
 128 x  128: Q5_0    15.4 GFLOPS (128 runs) | Q5_1    15.3 GFLOPS (128 runs) | Q8_0    16.8 GFLOPS (128 runs)
 128 x  128: F16     16.2 GFLOPS (128 runs) | F32     14.7 GFLOPS (128 runs)
 256 x  256: Q4_0    61.3 GFLOPS (128 runs) | Q4_1    58.5 GFLOPS (128 runs)
 256 x  256: Q5_0    55.1 GFLOPS (128 runs) | Q5_1    53.3 GFLOPS (128 runs) | Q8_0    65.9 GFLOPS (128 runs)
 256 x  256: F16     55.2 GFLOPS (128 runs) | F32     54.8 GFLOPS (128 runs)
 512 x  512: Q4_0   107.3 GFLOPS (128 runs) | Q4_1   109.4 GFLOPS (128 runs)
 512 x  512: Q5_0    88.9 GFLOPS (128 runs) | Q5_1    84.6 GFLOPS (128 runs) | Q8_0   129.5 GFLOPS (128 runs)
 512 x  512: F16     77.7 GFLOPS (128 runs) | F32     83.7 GFLOPS (128 runs)
1024 x 1024: Q4_0   127.9 GFLOPS ( 60 runs) | Q4_1   132.8 GFLOPS ( 62 runs)
1024 x 1024: Q5_0   103.0 GFLOPS ( 48 runs) | Q5_1   102.6 GFLOPS ( 48 runs) | Q8_0   159.2 GFLOPS ( 75 runs)
1024 x 1024: F16     82.5 GFLOPS ( 39 runs) | F32     84.6 GFLOPS ( 40 runs)
2048 x 2048: Q4_0   136.0 GFLOPS (  8 runs) | Q4_1   141.5 GFLOPS (  9 runs)
2048 x 2048: Q5_0   107.6 GFLOPS (  7 runs) | Q5_1   108.3 GFLOPS (  7 runs) | Q8_0   164.3 GFLOPS ( 10 runs)
2048 x 2048: F16     83.9 GFLOPS (  5 runs) | F32     92.2 GFLOPS (  6 runs)
4096 x 4096: Q4_0   138.2 GFLOPS (  3 runs) | Q4_1   144.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0   109.5 GFLOPS (  3 runs) | Q5_1   110.5 GFLOPS (  3 runs) | Q8_0   167.6 GFLOPS (  3 runs)
4096 x 4096: F16     83.0 GFLOPS (  3 runs) | F32     74.1 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
AMD Custom APU 0405	Arch Linux	AVX2	tiny	8	630.13	3.49	105.05	`903c957`
AMD Custom APU 0405	Arch Linux	AVX2	tiny-q5_1	8	590.61	2.08	96.99	`903c957`
AMD Custom APU 0405	Arch Linux	AVX2	base	8	1465.93	5.90	248.97	`903c957`
AMD Custom APU 0405	Arch Linux	AVX2	base-q5_1	8	1333.47	3.43	224.44	`903c957`
AMD Custom APU 0405	Arch Linux	AVX2	small	8	5504.17	15.86	938.16	`903c957`
AMD Custom APU 0405	Arch Linux	AVX2	small-q5_1	8	4814.22	9.51	814.53	`903c957`

using iGPU

Running ggml_mul_mat benchmark with 8 threads

ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1033'
ggml_opencl: device FP16 support: true
  64 x   64: Q4_0     0.5 GFLOPS (128 runs) | Q4_1     0.5 GFLOPS (128 runs)
  64 x   64: Q5_0     0.5 GFLOPS (128 runs) | Q5_1     0.5 GFLOPS (128 runs) | Q8_0     0.5 GFLOPS (128 runs)
  64 x   64: F16      0.5 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     4.0 GFLOPS (128 runs) | Q4_1     3.8 GFLOPS (128 runs)
 128 x  128: Q5_0     3.9 GFLOPS (128 runs) | Q5_1     3.9 GFLOPS (128 runs) | Q8_0     3.8 GFLOPS (128 runs)
 128 x  128: F16      3.8 GFLOPS (128 runs) | F32      4.0 GFLOPS (128 runs)
 256 x  256: Q4_0    24.4 GFLOPS (128 runs) | Q4_1    23.9 GFLOPS (128 runs)
 256 x  256: Q5_0    23.3 GFLOPS (128 runs) | Q5_1    23.9 GFLOPS (128 runs) | Q8_0    23.3 GFLOPS (128 runs)
 256 x  256: F16     25.9 GFLOPS (128 runs) | F32     24.3 GFLOPS (128 runs)
 512 x  512: Q4_0    70.6 GFLOPS (128 runs) | Q4_1    70.3 GFLOPS (128 runs)
 512 x  512: Q5_0    69.2 GFLOPS (128 runs) | Q5_1    70.7 GFLOPS (128 runs) | Q8_0    68.3 GFLOPS (128 runs)
 512 x  512: F16    140.9 GFLOPS (128 runs) | F32     67.0 GFLOPS (128 runs)
1024 x 1024: Q4_0   267.7 GFLOPS (125 runs) | Q4_1   268.2 GFLOPS (125 runs)
1024 x 1024: Q5_0   266.0 GFLOPS (124 runs) | Q5_1   265.2 GFLOPS (124 runs) | Q8_0   278.5 GFLOPS (128 runs)
1024 x 1024: F16    304.0 GFLOPS (128 runs) | F32    273.9 GFLOPS (128 runs)
2048 x 2048: Q4_0   622.1 GFLOPS ( 37 runs) | Q4_1   635.6 GFLOPS ( 38 runs)
2048 x 2048: Q5_0   632.1 GFLOPS ( 37 runs) | Q5_1   634.8 GFLOPS ( 37 runs) | Q8_0   627.2 GFLOPS ( 37 runs)
2048 x 2048: F16    524.5 GFLOPS ( 31 runs) | F32    625.6 GFLOPS ( 37 runs)
4096 x 4096: Q4_0   724.7 GFLOPS (  6 runs) | Q4_1   728.6 GFLOPS (  6 runs)
4096 x 4096: Q5_0   723.8 GFLOPS (  6 runs) | Q5_1   726.4 GFLOPS (  6 runs) | Q8_0   722.9 GFLOPS (  6 runs)
4096 x 4096: F16    857.6 GFLOPS (  7 runs) | F32    725.6 GFLOPS (  6 runs)

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
AMD Custom APU 0405 (iGPU)	Arch Linux	AVX2 BLAS	tiny	8	574.83	3.67	206.55	`903c957`
AMD Custom APU 0405 (iGPU)	Arch Linux	AVX2 BLAS	tiny-q5_1	8	599.32	2.49	231.62	`903c957`
AMD Custom APU 0405 (iGPU)	Arch Linux	AVX2 BLAS	base	8	1119.73	6.06	379.41	`903c957`
AMD Custom APU 0405 (iGPU)	Arch Linux	AVX2 BLAS	base-q5_1	8	1168.92	4.01	484.14	`903c957`
AMD Custom APU 0405 (iGPU)	Arch Linux	AVX2 BLAS	small	8	3492.86	16.07	1093.41	`903c957`
AMD Custom APU 0405 (iGPU)	Arch Linux	AVX2 BLAS	small-q5_1	8	3476.28	11.12	1403.84	`903c957`

Didn't test anything more as I didn't have the disk space to download them.

henry2man · 2023-10-02T06:06:44Z

Hi there. These are some tests with a Mac Studio M1 Ultra with MacOS Sonoma 14.0.

Running memcpy benchmark

memcpy: 36.64 GB/s (1 thread)
sum:    -536871564.000000

Running ggml_mul_mat benchmark with 20 threads

  64 x   64: Q4_0     1.7 GFLOPS (128 runs) | Q4_1     2.1 GFLOPS (128 runs)
  64 x   64: Q5_0     2.5 GFLOPS (128 runs) | Q5_1     2.8 GFLOPS (128 runs) | Q8_0     2.8 GFLOPS (128 runs)
  64 x   64: F16      2.7 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    20.1 GFLOPS (128 runs)
 128 x  128: Q5_0    19.5 GFLOPS (128 runs) | Q5_1    18.4 GFLOPS (128 runs) | Q8_0    17.2 GFLOPS (128 runs)
 128 x  128: F16     19.0 GFLOPS (128 runs) | F32     21.1 GFLOPS (128 runs)
 256 x  256: Q4_0   101.8 GFLOPS (128 runs) | Q4_1    94.0 GFLOPS (128 runs)
 256 x  256: Q5_0   100.0 GFLOPS (128 runs) | Q5_1    98.2 GFLOPS (128 runs) | Q8_0    97.4 GFLOPS (128 runs)
 256 x  256: F16     97.1 GFLOPS (128 runs) | F32     92.9 GFLOPS (128 runs)
 512 x  512: Q4_0   388.1 GFLOPS (128 runs) | Q4_1   365.4 GFLOPS (128 runs)
 512 x  512: Q5_0   412.8 GFLOPS (128 runs) | Q5_1   432.6 GFLOPS (128 runs) | Q8_0   405.2 GFLOPS (128 runs)
 512 x  512: F16    444.4 GFLOPS (128 runs) | F32    491.0 GFLOPS (128 runs)
1024 x 1024: Q4_0  1172.6 GFLOPS (128 runs) | Q4_1  1235.8 GFLOPS (128 runs)
1024 x 1024: Q5_0  1462.5 GFLOPS (128 runs) | Q5_1  1483.5 GFLOPS (128 runs) | Q8_0  1432.8 GFLOPS (128 runs)
1024 x 1024: F16   1515.2 GFLOPS (128 runs) | F32   1738.2 GFLOPS (128 runs)
2048 x 2048: Q4_0  2588.0 GFLOPS (128 runs) | Q4_1  2340.6 GFLOPS (128 runs)
2048 x 2048: Q5_0  2508.7 GFLOPS (128 runs) | Q5_1  2476.3 GFLOPS (128 runs) | Q8_0  2717.1 GFLOPS (128 runs)
2048 x 2048: F16   2707.1 GFLOPS (128 runs) | F32   2898.0 GFLOPS (128 runs)
4096 x 4096: Q4_0  2698.6 GFLOPS ( 20 runs) | Q4_1  2538.5 GFLOPS ( 19 runs)
4096 x 4096: Q5_0  2593.5 GFLOPS ( 19 runs) | Q5_1  2594.1 GFLOPS ( 19 runs) | Q8_0  2664.0 GFLOPS ( 20 runs)
4096 x 4096: F16   2748.2 GFLOPS ( 20 runs) | F32   2825.4 GFLOPS ( 21 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
Apple M1 Ultra	MacOS 14.0	NEON BLAS METAL	tiny	20	22.85	1.97	4.22	`c76c11e`
Apple M1 Ultra	MacOS 14.0	NEON BLAS METAL	base	20	31.55	2.78	6.48	`c76c11e`
Apple M1 Ultra	MacOS 14.0	NEON BLAS METAL	small	20	81.78	4.91	17.00	`c76c11e`
Apple M1 Ultra	MacOS 14.0	NEON BLAS METAL	medium	20	217.00	10.54	41.68	`c76c11e`
Apple M1 Ultra	MacOS 14.0	NEON BLAS METAL	large	20	358.84	14.48	73.92	`c76c11e`

Bad-Science · 2023-10-12T05:02:00Z

@henry2man amazing results. What compiler options are you using / are you using ANE?

henry2man · 2023-10-12T23:49:26Z

@henry2man amazing results. What compiler options are you using / are you using ANE?

I was simply experimenting with the project, so as far as I remember I just used the extras/all shell script without any other tweaks, just following the given instructions.

If you want me to execute specific tests please tell me and I'll be glad to contribute the results here.

EDIT: after a deeper research, I think I didn't make use of ANE, but I'm not 100% sure.

nickovs · 2023-11-03T03:22:08Z

Results for the new Raspberry Pi 5. Tests performed on a board with the active cooler. uname -a output is:

Linux newpi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

CPU	OS	Config	Model	Threads	Encode	Decode	Commit
BCM2712	Bookworm 12.2	NEON	4	tiny	1106.11	183.67	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	tiny.en	1109.66	201.3	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	base	2479.82	346.65	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	base.en	2465.12	363.86	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	small	8308.3	963.24	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	small.en	8342.25	1119.25	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	medium.en	26407.77	2893.55	`54c978c`
BCM2712	Bookworm 12.2	NEON	4	medium	26468.86	2919.43	`54c978c`

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.

NOTE: The packaged version of OpenBLAS has not been recompiled for the new CPU architecture, so it is about 50% slower than whisper.cpp's native NEON implementation. I will post benchmarks using OpenBLAS once I have built a version for the new CPU.

The memcpy and ggml_mul_mat benchmarks show:

memcpy: 4.64 GB/s (1 thread)
sum:    136902081526.000000

  64 x   64: Q4_0     5.5 GFLOPS (128 runs) | Q4_1     5.1 GFLOPS (128 runs)
  64 x   64: Q5_0     4.7 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
  64 x   64: F16      5.0 GFLOPS (128 runs) | F32      4.9 GFLOPS (128 runs)
 128 x  128: Q4_0    22.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.7 GFLOPS (128 runs) | Q5_1    20.3 GFLOPS (128 runs) | Q8_0    23.9 GFLOPS (128 runs)
 128 x  128: F16     26.3 GFLOPS (128 runs) | F32     13.3 GFLOPS (128 runs)
 256 x  256: Q4_0    39.0 GFLOPS (128 runs) | Q4_1    49.4 GFLOPS (128 runs)
 256 x  256: Q5_0    33.0 GFLOPS (128 runs) | Q5_1    37.5 GFLOPS (128 runs) | Q8_0    58.6 GFLOPS (128 runs)
 256 x  256: F16     64.1 GFLOPS (128 runs) | F32     48.4 GFLOPS (128 runs)
 512 x  512: Q4_0    62.6 GFLOPS (128 runs) | Q4_1    62.3 GFLOPS (128 runs)
 512 x  512: Q5_0    49.9 GFLOPS (128 runs) | Q5_1    46.1 GFLOPS (128 runs) | Q8_0    76.2 GFLOPS (128 runs)
 512 x  512: F16     80.1 GFLOPS (128 runs) | F32     51.1 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.9 GFLOPS ( 32 runs) | Q4_1    67.6 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    53.5 GFLOPS ( 25 runs) | Q5_1    50.4 GFLOPS ( 24 runs) | Q8_0    85.4 GFLOPS ( 40 runs)
1024 x 1024: F16     92.9 GFLOPS ( 44 runs) | F32     48.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    71.0 GFLOPS (  5 runs) | Q4_1    72.2 GFLOPS (  5 runs)
2048 x 2048: Q5_0    55.7 GFLOPS (  4 runs) | Q5_1    52.3 GFLOPS (  4 runs) | Q8_0    87.6 GFLOPS (  6 runs)
2048 x 2048: F16     93.1 GFLOPS (  6 runs) | F32     43.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.2 GFLOPS (  3 runs) | Q4_1    73.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    55.9 GFLOPS (  3 runs) | Q5_1    52.7 GFLOPS (  3 runs) | Q8_0    86.9 GFLOPS (  3 runs)
4096 x 4096: F16     86.8 GFLOPS (  3 runs) | F32     38.4 GFLOPS (  3 runs)

marjisound · 2023-11-03T09:10:44Z

CPU details: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
GPU name: NVIDIA Tesla T4
OS: Linux 14 22.04.1-Ubuntu
Compiler: cc (Ubuntu 11.4.0-1ubuntu1 22.04) 11.4.0

WHISPER_CUBLAS=1 make -j bench && ./extra/bench-all.sh

I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx2 -mfma -mf16c -mavx -msse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make: 'bench' is up to date.
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 5.05 GB/s
sum:    -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:   64 x   64: Q4_0     3.8 GFLOPS (128 runs) / Q4_1     3.8 GFLOPS (128 runs) / F16     3.8 GFLOPS (128 runs) / F32     3.9 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0    23.6 GFLOPS (128 runs) / Q4_1    24.0 GFLOPS (128 runs) / F16    22.1 GFLOPS (128 runs) / F32    22.4 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    90.3 GFLOPS (128 runs) / Q4_1   100.0 GFLOPS (128 runs) / F16    92.0 GFLOPS (128 runs) / F32    92.3 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0   278.8 GFLOPS (128 runs) / Q4_1   277.6 GFLOPS (128 runs) / F16   244.9 GFLOPS (128 runs) / F32   242.1 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0   859.2 GFLOPS (128 runs) / Q4_1   853.6 GFLOPS (128 runs) / F16   648.3 GFLOPS (128 runs) / F32   685.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: Q4_0  1583.4 GFLOPS ( 93 runs) / Q4_1  1585.1 GFLOPS ( 93 runs) / F16  1383.9 GFLOPS ( 81 runs) / F32  1359.7 GFLOPS ( 80 runs)
ggml_mul_mat: 4096 x 4096: Q4_0  2525.9 GFLOPS ( 19 runs) / Q4_1  2658.6 GFLOPS ( 20 runs) / F16  2716.0 GFLOPS ( 20 runs) / F32  2302.7 GFLOPS ( 17 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Xeon(R)	Ubuntu	AVX2 BLAS	tiny	4	429	550	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	base	4	521	1133	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	small	4	798	3025	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	medium	4	1701	7639	`fa8dbdc`
Xeon(R)	Ubuntu	AVX2 BLAS	large	4	2966	12927	`fa8dbdc`

StuartIanNaylor · 2023-11-03T13:42:26Z

Whats happening with commit 8a2bee6?
I was just interested with the same Master Opi5 vs Rpi5, but seem to have an extra PP that I am sure I will find a use for
Rpi 5gb
Linux raspberrypi 6.1.0-rpi4-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

memcpy: 5.32 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.0 GFLOPS (128 runs) | Q4_1     5.9 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     4.9 GFLOPS (128 runs) | Q8_0     1.9 GFLOPS (128 runs)
  64 x   64: F16      6.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 128 x  128: Q4_0    23.9 GFLOPS (128 runs) | Q4_1    22.6 GFLOPS (128 runs)
 128 x  128: Q5_0    21.4 GFLOPS (128 runs) | Q5_1    20.4 GFLOPS (128 runs) | Q8_0    11.4 GFLOPS (128 runs)
 128 x  128: F16     28.6 GFLOPS (128 runs) | F32     26.2 GFLOPS (128 runs)
 256 x  256: Q4_0    49.8 GFLOPS (128 runs) | Q4_1    49.6 GFLOPS (128 runs)
 256 x  256: Q5_0    40.9 GFLOPS (128 runs) | Q5_1    24.8 GFLOPS (128 runs) | Q8_0    59.0 GFLOPS (128 runs)
 256 x  256: F16     63.0 GFLOPS (128 runs) | F32     29.6 GFLOPS (128 runs)
 512 x  512: Q4_0    56.6 GFLOPS (128 runs) | Q4_1    56.5 GFLOPS (128 runs)
 512 x  512: Q5_0    30.4 GFLOPS (114 runs) | Q5_1    36.5 GFLOPS (128 runs) | Q8_0    71.2 GFLOPS (128 runs)
 512 x  512: F16     64.6 GFLOPS (128 runs) | F32     35.2 GFLOPS (128 runs)
1024 x 1024: Q4_0    67.4 GFLOPS ( 32 runs) | Q4_1    68.7 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    38.1 GFLOPS ( 18 runs) | Q5_1    32.3 GFLOPS ( 16 runs) | Q8_0    61.3 GFLOPS ( 29 runs)
1024 x 1024: F16     71.7 GFLOPS ( 34 runs) | F32     35.1 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    71.4 GFLOPS (  5 runs) | Q4_1    71.5 GFLOPS (  5 runs)
2048 x 2048: Q5_0    38.1 GFLOPS (  3 runs) | Q5_1    36.9 GFLOPS (  3 runs) | Q8_0    63.5 GFLOPS (  4 runs)
2048 x 2048: F16     68.6 GFLOPS (  4 runs) | F32     32.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    66.8 GFLOPS (  3 runs) | Q4_1    62.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.5 GFLOPS (  3 runs) | Q5_1    37.0 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     61.5 GFLOPS (  3 runs) | F32     29.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Rpi5 BCM2712 | bookworm |             NEON |        tiny |   4 | 1206.23 |    6.67 |  198.84 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |        base |   4 | 2862.56 |   11.74 |  466.51 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |       small |   4 | 9630.88 |   32.81 | 1650.18 | 8a2bee6 |
| Rpi5 BCM2712 | bookworm |             NEON |      medium |   4 |      ms |   99.64 | 5601.57 | 8a2bee6 |

Opi5 4gb
Linux ubuntu 6.6.0 #1 SMP PREEMPT Mon Oct 30 22:54:25 GMT 2023 aarch64 aarch64 aarch64 GNU/Linux
Mainline Linux than the Rockchip BSP https://github.com/Joshua-Riek/ubuntu-rockchip/releases/tag/v1.29.1

memcpy: 10.93 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     4.1 GFLOPS (128 runs)
  64 x   64: Q5_0     5.9 GFLOPS (128 runs) | Q5_1     6.0 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      4.1 GFLOPS (128 runs) | F32      6.8 GFLOPS (128 runs)
 128 x  128: Q4_0    14.0 GFLOPS (128 runs) | Q4_1    19.1 GFLOPS (128 runs)
 128 x  128: Q5_0    15.5 GFLOPS (128 runs) | Q5_1    12.7 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     22.1 GFLOPS (128 runs) | F32     21.2 GFLOPS (128 runs)
 256 x  256: Q4_0    45.0 GFLOPS (128 runs) | Q4_1    45.0 GFLOPS (128 runs)
 256 x  256: Q5_0    29.0 GFLOPS (128 runs) | Q5_1    29.6 GFLOPS (128 runs) | Q8_0    42.8 GFLOPS (128 runs)
 256 x  256: F16     42.5 GFLOPS (128 runs) | F32     42.6 GFLOPS (128 runs)
 512 x  512: Q4_0    55.8 GFLOPS (128 runs) | Q4_1    56.0 GFLOPS (128 runs)
 512 x  512: Q5_0    35.5 GFLOPS (128 runs) | Q5_1    36.7 GFLOPS (128 runs) | Q8_0    61.9 GFLOPS (128 runs)
 512 x  512: F16     80.7 GFLOPS (128 runs) | F32     49.6 GFLOPS (128 runs)
1024 x 1024: Q4_0    60.6 GFLOPS ( 29 runs) | Q4_1    61.4 GFLOPS ( 29 runs)
1024 x 1024: Q5_0    37.6 GFLOPS ( 18 runs) | Q5_1    39.3 GFLOPS ( 19 runs) | Q8_0    68.2 GFLOPS ( 32 runs)
1024 x 1024: F16     93.1 GFLOPS ( 44 runs) | F32     46.4 GFLOPS ( 22 runs)
2048 x 2048: Q4_0    63.1 GFLOPS (  4 runs) | Q4_1    64.1 GFLOPS (  4 runs)
2048 x 2048: Q5_0    39.2 GFLOPS (  3 runs) | Q5_1    41.0 GFLOPS (  3 runs) | Q8_0    70.9 GFLOPS (  5 runs)
2048 x 2048: F16     87.9 GFLOPS (  6 runs) | F32     41.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    65.3 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.7 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.7 GFLOPS (  3 runs)
4096 x 4096: F16     80.7 GFLOPS (  3 runs) | F32     38.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        tiny |   4 |  782.52 |    3.10 |  135.25 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |        base |   4 | 1754.69 |   11.81 |  304.06 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |       small |   4 | 6226.10 |   15.26 | 1075.54 | 8a2bee6 |
| Opi5 Rk3588s | 22.04.3 LTS (Jammy Jellyfish) |             NEON |      medium |   4 |      ms |   44.75 | 3425.05 | 8a2bee6 |

ggerganov · 2023-11-03T13:53:27Z

@nickovs These are some very interesting results. Looking forward to the OpenBLAS results as well.

@StuartIanNaylor The PP timing is the "prompt processing" time for a prompt of 256 tokens. As we transcribe with whisper, the context (i.e. the previously transcribed text) grows up to n_text_ctx. For each new audio segment that we process, we have to process the context. This processing is very similar to the token-by-token text generation during decoding, but it is much faster since we process 256 tokens at once.

nickovs · 2023-11-03T20:10:52Z

By way of comparison to the benchmarks I posted above, here is are the matrix multiplication numbers for the same Raspberry Pi 5 using OpenBLAS. It is notable that Whisper.cpp's native NEON code outperforms OpenBLAS on the Pi5 for everything except FP32, where OpenBLAS wins by some margin.

  64 x   64: Q4_0     4.4 GFLOPS (128 runs) | Q4_1     4.3 GFLOPS (128 runs)
  64 x   64: Q5_0     3.7 GFLOPS (128 runs) | Q5_1     4.2 GFLOPS (128 runs) | Q8_0     4.1 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      4.1 GFLOPS (128 runs)
 128 x  128: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
 128 x  128: Q5_0     0.9 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     0.9 GFLOPS (128 runs)
 128 x  128: F16      0.9 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 256 x  256: Q4_0     6.3 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
 256 x  256: Q5_0     6.4 GFLOPS (128 runs) | Q5_1     6.3 GFLOPS (128 runs) | Q8_0     6.4 GFLOPS (128 runs)
 256 x  256: F16      6.4 GFLOPS (128 runs) | F32      6.5 GFLOPS (128 runs)
 512 x  512: Q4_0    19.7 GFLOPS ( 74 runs) | Q4_1    20.4 GFLOPS ( 76 runs)
 512 x  512: Q5_0    23.7 GFLOPS ( 89 runs) | Q5_1    23.5 GFLOPS ( 89 runs) | Q8_0    23.7 GFLOPS ( 89 runs)
 512 x  512: F16     24.0 GFLOPS ( 90 runs) | F32     25.3 GFLOPS ( 95 runs)
1024 x 1024: Q4_0    35.5 GFLOPS ( 17 runs) | Q4_1    36.5 GFLOPS ( 17 runs)
1024 x 1024: Q5_0    38.9 GFLOPS ( 19 runs) | Q5_1    39.1 GFLOPS ( 19 runs) | Q8_0    38.7 GFLOPS ( 19 runs)
1024 x 1024: F16     39.3 GFLOPS ( 19 runs) | F32     40.9 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    52.8 GFLOPS (  4 runs) | Q4_1    55.4 GFLOPS (  4 runs)
2048 x 2048: Q5_0    56.8 GFLOPS (  4 runs) | Q5_1    55.6 GFLOPS (  4 runs) | Q8_0    56.5 GFLOPS (  4 runs)
2048 x 2048: F16     56.1 GFLOPS (  4 runs) | F32     56.4 GFLOPS (  4 runs)
4096 x 4096: Q4_0    55.3 GFLOPS (  3 runs) | Q4_1    56.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    58.9 GFLOPS (  3 runs) | Q5_1    60.0 GFLOPS (  3 runs) | Q8_0    61.4 GFLOPS (  3 runs)
4096 x 4096: F16     59.3 GFLOPS (  3 runs) | F32     60.4 GFLOPS (  3 runs)

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

StuartIanNaylor · 2023-11-04T02:49:41Z

I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version.

I think this is where we benefit from ArmV8.2 and being a subgroup of Apple Silicon first-class citizen - optimized via ARM NEON.
If you do a lscpu
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
So I guess we benefit that GGML is optimised aroung V8.2+ architecture
What should be interesting with https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast is that the GPU on the Pi5 & Rk3588(s) should be able to use OpenCL but in testing I am finding that the same and wondering if that is also similar.
I never worked out if its due to the serial nature of Whisper that you will only get a speedup if the GPU is faster than the CPU but on testing I get a huge slow down whilst in other ML tests the supposed FP32 610.6 GFLOPS of the mali G610 works mightily at approx 75% of the CPU with ArmNN tests using the GPU Tflite OpenCL delegate.
I am presuming CLBlast is somewhat similar and may not be well optimised for some data types?

These results are 4.5 to 6.2 times faster than the Raspberry Pi 4.
Not too sure about that as likely the same commit would have to be tested as seem to remember thinking RK3588s was < 5x Pi4 and likely due to memory bandwidth, quite a bit faster than a Pi5.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor

memcpy: 10.50 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.8 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     3.5 GFLOPS (128 runs)
  64 x   64: F16      3.4 GFLOPS (128 runs) | F32      3.4 GFLOPS (128 runs)
 128 x  128: Q4_0     7.9 GFLOPS (128 runs) | Q4_1     8.1 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.9 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1    11.1 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.5 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.5 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.7 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.8 GFLOPS ( 33 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.4 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.2 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    32.2 GFLOPS ( 15 runs) | Q4_1    33.2 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    24.9 GFLOPS ( 12 runs) | Q5_1    25.7 GFLOPS ( 12 runs) | Q8_0    35.2 GFLOPS ( 17 runs)
1024 x 1024: F16     38.0 GFLOPS ( 18 runs) | F32     27.5 GFLOPS ( 13 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    59.5 GFLOPS (  4 runs)
2048 x 2048: Q5_0    38.0 GFLOPS (  3 runs) | Q5_1    39.3 GFLOPS (  3 runs) | Q8_0    64.3 GFLOPS (  4 runs)
2048 x 2048: F16     77.9 GFLOPS (  5 runs) | F32     38.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.4 GFLOPS (  3 runs) | Q4_1    64.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.9 GFLOPS (  3 runs) | Q5_1    41.7 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     37.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  853.56 |    7.37 |  161.81 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 1847.86 |   13.00 |  338.18 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 6289.17 |   39.19 | 1109.25 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   67.99 | 3454.96 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  107.50 | 6541.15 | f96e1c5 |

Linux raspberrypi 6.1.0-rpi4-rpi-2712 Rpi5 4GB performance governor

memcpy: 6.03 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.7 GFLOPS (128 runs) | Q4_1     5.5 GFLOPS (128 runs)
  64 x   64: Q5_0     5.3 GFLOPS (128 runs) | Q5_1     5.1 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
  64 x   64: F16      5.6 GFLOPS (128 runs) | F32      5.7 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs)
 128 x  128: Q5_0    12.3 GFLOPS (128 runs) | Q5_1    11.8 GFLOPS (128 runs) | Q8_0    11.3 GFLOPS (128 runs)
 128 x  128: F16     15.4 GFLOPS (128 runs) | F32     26.5 GFLOPS (128 runs)
 256 x  256: Q4_0    49.7 GFLOPS (128 runs) | Q4_1    50.3 GFLOPS (128 runs)
 256 x  256: Q5_0    41.8 GFLOPS (128 runs) | Q5_1    39.0 GFLOPS (128 runs) | Q8_0    59.7 GFLOPS (128 runs)
 256 x  256: F16     65.2 GFLOPS (128 runs) | F32     48.7 GFLOPS (128 runs)
 512 x  512: Q4_0    63.0 GFLOPS (128 runs) | Q4_1    63.6 GFLOPS (128 runs)
 512 x  512: Q5_0    50.5 GFLOPS (128 runs) | Q5_1    47.3 GFLOPS (128 runs) | Q8_0    77.7 GFLOPS (128 runs)
 512 x  512: F16     85.6 GFLOPS (128 runs) | F32     53.3 GFLOPS (128 runs)
1024 x 1024: Q4_0    68.1 GFLOPS ( 32 runs) | Q4_1    69.8 GFLOPS ( 33 runs)
1024 x 1024: Q5_0    54.1 GFLOPS ( 26 runs) | Q5_1    51.2 GFLOPS ( 24 runs) | Q8_0    86.0 GFLOPS ( 41 runs)
1024 x 1024: F16     93.6 GFLOPS ( 44 runs) | F32     49.0 GFLOPS ( 23 runs)
2048 x 2048: Q4_0    70.8 GFLOPS (  5 runs) | Q4_1    72.8 GFLOPS (  5 runs)
2048 x 2048: Q5_0    56.1 GFLOPS (  4 runs) | Q5_1    53.0 GFLOPS (  4 runs) | Q8_0    88.1 GFLOPS (  6 runs)
2048 x 2048: F16     93.7 GFLOPS (  6 runs) | F32     44.4 GFLOPS (  3 runs)
4096 x 4096: Q4_0    72.6 GFLOPS (  3 runs) | Q4_1    74.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    56.2 GFLOPS (  3 runs) | Q5_1    53.3 GFLOPS (  3 runs) | Q8_0    88.4 GFLOPS (  3 runs)
4096 x 4096: F16     86.7 GFLOPS (  3 runs) | F32     39.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 | 1049.00 |    6.74 |  149.32 | f96e1c5 |
| <todo> | <todo> |             NEON |        base |   4 | 2362.92 |   12.60 |  361.37 | f96e1c5 |
| <todo> | <todo> |             NEON |       small |   4 | 8081.87 |   35.65 | 1283.34 | f96e1c5 |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |  105.77 | 4360.80 | f96e1c5 |
| <todo> | <todo> |             NEON |       large |   4 |      ms |  189.93 | 8158.78 | f96e1c5 |

I dunno to be honest why Gflops is higher but whilst the Enc the biggest chunk of process faster, maybe mem bandwidth?
Its like4like with the perf governor, due to pref of running Whisper that way of race-till-idle.

nickovs · 2023-11-04T18:41:12Z

@StuartIanNaylor Here is a straight up comparison of the same 54c978c commit between the Pi4 and the Pi5, both running the code compiled on the Pi4 on the Pi5 and then also recompiling the same commit on the Pi5.

Model	Pi4		Pi4 code on Pi5		Speedup on same compilation		Recompiled on Pi5		Speedup on recompiled code
	Encode	Decode	Encode	Decode	Encode	Decode	Encode	Decode	Encode
tiny	5246.14	510.57	2694.38	188.38	1.95	2.71	1106.11	183.67	4.74
tiny.en	5264.76	551.17	2744.80	203.94	1.92	2.70	1109.66	201.3	4.74
base.en	12473.07	1004.23	6345.28	363.15	1.97	2.77	2479.82	346.65	5.03
base	12453.04	972.29	6399.54	348.33	1.95	2.79	2465.12	363.86	5.05
small.en	48849.9	3316.15	24127.58	961.75	2.02	3.45	8308.3	963.24	5.88
small	49671.25	2953	24134.46	1109.70	2.06	2.66	8342.25	1119.25	5.95
medium.en	169889.39	8451.51	79045.66	2815.81	2.15	3.00	26407.77	2893.55	6.43
medium	173236.92	8531.94	79075.19	2836.38	2.19	3.01	26468.86	2919.43	6.54

This suggests that there is a little better than a 2-fold performance improvement on encode, and more like a 2.8 fold improvement on decode, just moving the code from the Pi4 to the Pi5. Recompiling on the Pi5 raises the encode performance to between 4.74 and 6.54 times faster that on the Pi4, but the decode performance remains only about 2.8 times faster than the Pi4 and doesn't benefit a great deal from the recompilation.

(Note that this table hits GitHub's 10 column limit, so the decode speedup may not be displayed, but the numbers are in the comment source.)

The key thing here as far as I'm concerned is that on the Pi5 the small model runs in better than real time, whereas on the Pi4 you were stuck using the tiny model for real-time work.

jwinarske · 2023-11-04T19:43:37Z

It would be great to have a test results db for this. I'm thinking similar to what DRM info does

StuartIanNaylor · 2023-11-05T09:38:39Z

@jwinarske that would be great as maybe a seperate repo of fixed commits as we are not benching the software but the hardware.
The Llama bench would be a good inclusion as the openLlama3b-q4 manages 20 Tokens/s on a Rk3588s-4gb.
I also like https://github.com/Tencent/ncnn/tree/master/benchmark as a pretty easy install and has a ready made list of smaller yolo type models.

Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor 54c978c

memcpy: 11.18 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.5 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     2.7 GFLOPS (128 runs) | Q5_1     2.8 GFLOPS (128 runs) | Q8_0     3.1 GFLOPS (128 runs)
  64 x   64: F16      3.3 GFLOPS (128 runs) | F32      3.2 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     8.0 GFLOPS (128 runs)
 128 x  128: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      9.5 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 256 x  256: Q4_0    10.6 GFLOPS (128 runs) | Q4_1    11.0 GFLOPS (128 runs)
 256 x  256: Q5_0     7.9 GFLOPS (128 runs) | Q5_1     8.4 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 256 x  256: F16     14.8 GFLOPS (128 runs) | F32      9.3 GFLOPS (128 runs)
 512 x  512: Q4_0    11.8 GFLOPS ( 44 runs) | Q4_1    12.4 GFLOPS ( 47 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0    11.5 GFLOPS ( 43 runs)
 512 x  512: F16     17.8 GFLOPS ( 67 runs) | F32      9.6 GFLOPS ( 36 runs)
1024 x 1024: Q4_0    32.7 GFLOPS ( 16 runs) | Q4_1    33.3 GFLOPS ( 16 runs)
1024 x 1024: Q5_0    25.2 GFLOPS ( 12 runs) | Q5_1    27.0 GFLOPS ( 13 runs) | Q8_0    36.0 GFLOPS ( 17 runs)
1024 x 1024: F16     39.4 GFLOPS ( 19 runs) | F32     28.1 GFLOPS ( 14 runs)
2048 x 2048: Q4_0    58.2 GFLOPS (  4 runs) | Q4_1    60.0 GFLOPS (  4 runs)
2048 x 2048: Q5_0    37.2 GFLOPS (  3 runs) | Q5_1    38.8 GFLOPS (  3 runs) | Q8_0    63.3 GFLOPS (  4 runs)
2048 x 2048: F16     78.3 GFLOPS (  5 runs) | F32     38.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    63.9 GFLOPS (  3 runs) | Q4_1    64.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    39.6 GFLOPS (  3 runs) | Q5_1    41.5 GFLOPS (  3 runs) | Q8_0    70.3 GFLOPS (  3 runs)
4096 x 4096: F16     78.6 GFLOPS (  3 runs) | F32     35.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

|    CPU |     OS |           Config |       Model |  Th |    Enc. |    Dec. |      PP |  Commit |
|    --- |    --- |              --- |         --- | --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |        tiny |   4 |  885.27 |    7.35 |  166.54 | 54c978c |
| <todo> | <todo> |             NEON |        base |   4 | 1888.93 |   12.61 |  347.61 | 54c978c |
| <todo> | <todo> |             NEON |       small |   4 | 6397.88 |   38.49 | 1111.82 | 54c978c |
| <todo> | <todo> |             NEON |      medium |   4 |      ms |   68.98 | 3511.72 | 54c978c |

@nickovs Dunno as before as A76 gets vector mat/mul and the code is optimised for ArmV8,2+ that the poor Pi4 with openBlas was approx < 5 times slower than a RK3588s.
The above is just same commit on a Opi5-4gb so Zram and swap comes into play with bigger models but from audio in to txt out last time I pegged the Pi4 as approx just less than x5 and ignored models it didn't manage in realtime.
I guess further optimisations have happened, the decode is less important to overall time or the Enc as that is the biggest process.

(venv) pi@raspberrypi:~/llama.cpp $ ./llama-bench -m  models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.77 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      5.42 ± 0.00 |

build: c41ea36 (1487)

ubuntu@ubuntu:~/llama.cpp$ ./llama-bench -m models/3b/open-llama-3b-q4_0.gguf -t 4
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | pp 512     |      9.14 ± 0.01 |
| llama 3B mostly Q4_0           |   1.84 GiB |     3.43 B | CPU        |          4 | tg 128     |      7.06 ± 0.05 |

"lib" is needed for windows. With this change, you can build whisper.cpp with OpenBLAS's prebuilt DLL. 1. extract a zip from https://github.com/xianyi/OpenBLAS/releases 2. copy the headers in (openblas)/include to the root directory of whisper.cpp 3. invoke cmake with -DCMAKE_LIBRARY_PATH=(openblas)\lib -DWHISPER_SUPPORT_OPENBLAS=ON 4. copy (openblas)/bin/libopenblas.dll to the same directory of whisper.dll after msbuild ggerganov/whisper.cpp#89 (comment)

Update whisper.cpp

petterreinholdtsen · 2024-02-24T07:59:31Z

Here is the result for NVIDIA GeForce GT 755M on Debian GNU/Linux 12 Bookworm using GCC 12.2.0 build with -DWHISPER_CLBLAST=ON:

whisper_init_from_file_with_params_no_state: loading model from '../nb-large-ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce GT 755M'
ggml_opencl: device FP16 support: false
whisper_model_load:      CPU buffer size =  3094.86 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.42 MB
whisper_init_state: compute buffer (encode) =  212.42 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =   99.24 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

whisper_print_timings:     load time =   712.98 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 29405.07 ms /     1 runs (29405.07 ms per run)
whisper_print_timings:   decode time = 25138.65 ms /   256 runs (   98.20 ms per run)
whisper_print_timings:   batchd time = 15522.25 ms /   320 runs (   48.51 ms per run)
whisper_print_timings:   prompt time = 120379.20 ms /  4096 runs (   29.39 ms per run)
whisper_print_timings:    total time = 190447.95 ms

zhouwg · 2024-03-06T10:22:31Z

benchmark result with 11th Gen Intel Core(TM) i7-11700F @ 2.50GHz + Ubuntu 20.04 + gcc version 9.4.0

CPU	OS	Mode	Threads	Load [ms]	Encode [ms]
i7-11700F	Ubuntu 20.04	tiny.en	4	46.72	4654.39
i7-11700F	Ubuntu 20.04	tiny.en	8	49.85	2981.43
i7-11700F	Ubuntu 20.04	small.en	4	175.02	51381.51
i7-11700F	Ubuntu 20.04	small.en	8	161.98	29662.80

./bench  -m ./models/ggml-small.en.bin -t 8 -w 2
  64 x   64: Q4_0     4.3 GFLOPS (128 runs) | Q4_1     4.4 GFLOPS (128 runs)
  64 x   64: Q5_0     4.0 GFLOPS (128 runs) | Q5_1     3.5 GFLOPS (128 runs) | Q8_0     4.7 GFLOPS (128 runs)
  64 x   64: F16      4.2 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 128 x  128: Q4_0    15.0 GFLOPS (128 runs) | Q4_1    15.3 GFLOPS (128 runs)
 128 x  128: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    12.3 GFLOPS (128 runs) | Q8_0    21.0 GFLOPS (128 runs)
 128 x  128: F16     11.1 GFLOPS (128 runs) | F32      8.7 GFLOPS (128 runs)
 256 x  256: Q4_0    25.4 GFLOPS (128 runs) | Q4_1    29.1 GFLOPS (128 runs)
 256 x  256: Q5_0    17.4 GFLOPS (128 runs) | Q5_1    18.7 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     13.8 GFLOPS (128 runs) | F32     10.4 GFLOPS (128 runs)
 512 x  512: Q4_0    31.1 GFLOPS (116 runs) | Q4_1    33.0 GFLOPS (124 runs)
 512 x  512: Q5_0    17.1 GFLOPS ( 64 runs) | Q5_1    20.5 GFLOPS ( 77 runs) | Q8_0    66.3 GFLOPS (128 runs)
 512 x  512: F16     14.0 GFLOPS ( 53 runs) | F32      9.3 GFLOPS ( 35 runs)
1024 x 1024: Q4_0    31.9 GFLOPS ( 16 runs) | Q4_1    31.0 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    20.0 GFLOPS ( 10 runs) | Q5_1    22.9 GFLOPS ( 11 runs) | Q8_0    80.1 GFLOPS ( 38 runs)
1024 x 1024: F16     14.6 GFLOPS (  7 runs) | F32      8.9 GFLOPS (  5 runs)
2048 x 2048: Q4_0    35.9 GFLOPS (  3 runs) | Q4_1    40.1 GFLOPS (  3 runs)
2048 x 2048: Q5_0    21.2 GFLOPS (  3 runs) | Q5_1    23.6 GFLOPS (  3 runs) | Q8_0    88.0 GFLOPS (  6 runs)
2048 x 2048: F16     14.4 GFLOPS (  3 runs) | F32      8.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    35.4 GFLOPS (  3 runs) | Q4_1    39.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    20.0 GFLOPS (  3 runs) | Q5_1    21.2 GFLOPS (  3 runs) | Q8_0    85.0 GFLOPS (  3 runs)
4096 x 4096: F16     13.5 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

./bench  -m ./models/ggml-small.en.bin -t 8 -w 1
memcpy:    9.43 GB/s (heat-up)
memcpy:    9.31 GB/s ( 1 thread)
memcpy:    9.15 GB/s ( 1 thread)
memcpy:    8.74 GB/s ( 2 thread)
memcpy:    8.67 GB/s ( 3 thread)
memcpy:    8.43 GB/s ( 4 thread)
memcpy:    8.42 GB/s ( 5 thread)
memcpy:    8.70 GB/s ( 6 thread)
memcpy:    8.63 GB/s ( 7 thread)
memcpy:    8.32 GB/s ( 8 thread)
sum:    -5119997019.000000

 ./bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

memcpy:    9.38 GB/s (heat-up)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.39 GB/s ( 1 thread)
memcpy:    9.12 GB/s ( 2 thread)
memcpy:    9.05 GB/s ( 3 thread)
memcpy:    8.68 GB/s ( 4 thread)
sum:    -3071998678.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.5 GFLOPS (128 runs) | Q8_0     9.4 GFLOPS (128 runs)
  64 x   64: F16      6.2 GFLOPS (128 runs) | F32      2.4 GFLOPS (128 runs)
 128 x  128: Q4_0    15.4 GFLOPS (128 runs) | Q4_1    16.6 GFLOPS (128 runs)
 128 x  128: Q5_0    10.6 GFLOPS (128 runs) | Q5_1    11.5 GFLOPS (128 runs) | Q8_0    25.9 GFLOPS (128 runs)
 128 x  128: F16      9.0 GFLOPS (128 runs) | F32      5.8 GFLOPS (128 runs)
 256 x  256: Q4_0    19.9 GFLOPS (128 runs) | Q4_1    22.8 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.9 GFLOPS (128 runs) | Q8_0    44.2 GFLOPS (128 runs)
 256 x  256: F16      9.4 GFLOPS (128 runs) | F32      7.6 GFLOPS (128 runs)
 512 x  512: Q4_0    21.7 GFLOPS ( 81 runs) | Q4_1    23.0 GFLOPS ( 86 runs)
 512 x  512: Q5_0    12.9 GFLOPS ( 48 runs) | Q5_1    13.9 GFLOPS ( 52 runs) | Q8_0    48.6 GFLOPS (128 runs)
 512 x  512: F16      8.9 GFLOPS ( 34 runs) | F32      6.8 GFLOPS ( 26 runs)
1024 x 1024: Q4_0    22.1 GFLOPS ( 11 runs) | Q4_1    24.9 GFLOPS ( 12 runs)
1024 x 1024: Q5_0    13.1 GFLOPS (  7 runs) | Q5_1    14.0 GFLOPS (  7 runs) | Q8_0    53.4 GFLOPS ( 25 runs)
1024 x 1024: F16      8.8 GFLOPS (  5 runs) | F32      6.5 GFLOPS (  4 runs)
2048 x 2048: Q4_0    22.6 GFLOPS (  3 runs) | Q4_1    25.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    13.1 GFLOPS (  3 runs) | Q5_1    14.7 GFLOPS (  3 runs) | Q8_0    57.1 GFLOPS (  4 runs)
2048 x 2048: F16      8.7 GFLOPS (  3 runs) | F32      6.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    21.5 GFLOPS (  3 runs) | Q4_1    23.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    12.2 GFLOPS (  3 runs) | Q5_1    13.5 GFLOPS (  3 runs) | Q8_0    53.9 GFLOPS (  3 runs)
4096 x 4096: F16      8.0 GFLOPS (  3 runs) | F32      5.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
i7-11700F	Ubuntu 20.04		base	4	ms	15.82	15.05	15.71	31989a5a

there is an impressive benchmark result(compare to above bench result in PC which was purchased by RMB12000(about USD 1700) a few years ago) with Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) + Xiaomi's HyperOS(derived from Android 14) + Android NDK r21e:

updated on 03-20-2024, Xiaomi 14 + Android NDK r26c( NDK r26c is required for special build optimization:https://github.com/cdeos/kantv/blob/master/external/whispercpp/CMakeLists.txt#L60)

obeone · 2024-04-24T23:50:58Z

CPU	OS	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	tiny	4	34.15	1.45	0.47	0.03	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	base	4	59.32	2.27	0.79	0.05	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	small	4	200.45	5.50	1.75	0.15	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	medium	4	534.54	12.88	3.90	0.37	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	large-v1	4	989.45	22.29	6.58	0.64	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	large-v2	4	962.34	22.38	6.61	0.64	`858452d`
Macbook M3 Pro	Sonoma 14.5	NEON BLAS METAL	large-v3	4	969.27	22.23	6.59	0.64	`858452d`

nanocosmos-ol · 2024-04-28T17:39:58Z

Different results for different code commits - older version is much faster!

CPU: AMD Ryzen 9 7950X3D 16-Core

commit 858452d Date: Wed Apr 24 14:56:30 2024 +0300

whisper_print_timings: load time = 64.61 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 878.59 ms / 1 runs ( 878.59 ms per run)
whisper_print_timings: decode time = 935.20 ms / 256 runs ( 3.65 ms per run)
whisper_print_timings: batchd time = 544.69 ms / 320 runs ( 1.70 ms per run)
whisper_print_timings: prompt time = 3865.51 ms / 4096 runs ( 0.94 ms per run)
whisper_print_timings: total time = 6225.76 ms

commit d03c60d Date: Wed Nov 8 04:53:31 2023 +0700

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

whisper_print_timings: load time = 83.24 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 693.48 ms / 1 runs ( 693.48 ms per run)
whisper_print_timings: decode time = 874.80 ms / 256 runs ( 3.42 ms per run)
whisper_print_timings: prompt time = 2249.08 ms / 16 runs ( 140.57 ms per run)
whisper_print_timings: total time = 3817.54 ms

ggerganov added the performance CPU and memory usage - results and comparisons label Oct 25, 2022

ggerganov pinned this issue Oct 26, 2022

This comment was marked as outdated.

Sign in to view

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023

bench.wasm : same as "bench" but runs in the browser (ggerganov#89)

d635844

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023

bench : more concise representation of the results (ggerganov#89)

0966694

This was referenced Jan 27, 2024

Add avx512 flags to makevars bnosac/audio.whisper#29

Merged

Integrate Metal bnosac/audio.whisper#34

Closed

kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024

Merge pull request ggerganov#89 from marmistrz/whisper.cpp

3fd6010

Update whisper.cpp

zhouwg mentioned this issue Mar 4, 2024

PoC:Integrate whisper.cpp to KanTV for purpose of clean-room implementation of real-time AI subtitle with English online-TV(OTT TV) zhouwg/kantv#64

Open

nanocosmos-ol mentioned this issue Apr 28, 2024

CPU Performance Regression? (Older version much faster) #2099

Open

Benchmark results #89

Benchmark results #89

Comments

ggerganov commented Oct 25, 2022 • edited

Encoder

memcpy

MacBook M1 Pro

Ryzen 9 5950X

ggml_mul_mat

MacBook M1 Pro

Ryzen 9 5950X

cdosoftei commented Oct 25, 2022 • edited by ggerganov

rjwilmsi commented Oct 26, 2022

ArtyomZemlyak commented Oct 26, 2022 • edited

ArtyomZemlyak commented Oct 26, 2022 • edited

ArtyomZemlyak commented Oct 26, 2022 • edited

ArtyomZemlyak commented Oct 26, 2022

ggerganov commented Oct 26, 2022

cristianglezm commented Oct 28, 2022 • edited

tazz4843 commented Oct 29, 2022 • edited

yujinqiu commented Oct 30, 2022

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

ggerganov commented Oct 31, 2022

This comment was marked as outdated.

trholding commented Oct 31, 2022

rgerganov commented Nov 5, 2022

jaybinks commented Nov 5, 2022 • edited

mark-beeby commented Nov 8, 2022

niksedk commented Nov 9, 2022 • edited

ggerganov commented Nov 9, 2022

niksedk commented Nov 9, 2022 • edited

j1nx commented Nov 17, 2022 • edited

StuartIanNaylor commented Nov 17, 2022 • edited

dodysw commented Nov 17, 2022 • edited

matth commented Nov 21, 2022

ggerganov commented Nov 21, 2022

matth commented Nov 21, 2022 • edited

letsgitcracking commented Aug 1, 2023

ThinkPad P1 Gen 4

Encoder

memcpy

ggml_mul_mat

marty1885 commented Aug 1, 2023

vieenrose commented Aug 6, 2023 • edited

Jetson Nano 4GB using cuBLAS

Compiler version

Memcpy

ggml_mul_mat

Model benchmark

vieenrose commented Aug 7, 2023

Jetson Orin Nano Developper Kit using cuBLAS

Compiler version

Memcpy

ggml_mul_mat

Model benchmark

vieenrose commented Aug 7, 2023 • edited

Intel Core i5-8400 CPU + NVIDIA GeForce GTX 1070 with cuBLAS

Compiler version

Memcpy

ggml_mul_mat

Model benchmark

tazz4843 commented Sep 20, 2023

CPU only

using iGPU

henry2man commented Oct 2, 2023

Bad-Science commented Oct 12, 2023

henry2man commented Oct 12, 2023 • edited

nickovs commented Nov 3, 2023

marjisound commented Nov 3, 2023

StuartIanNaylor commented Nov 3, 2023 • edited

ggerganov commented Nov 3, 2023 • edited

nickovs commented Nov 3, 2023

StuartIanNaylor commented Nov 4, 2023 • edited

nickovs commented Nov 4, 2023

jwinarske commented Nov 4, 2023

StuartIanNaylor commented Nov 5, 2023 • edited

petterreinholdtsen commented Feb 24, 2024

zhouwg commented Mar 6, 2024 • edited

ggerganov commented Oct 25, 2022 •

edited

cdosoftei commented Oct 25, 2022 •

edited by ggerganov

ArtyomZemlyak commented Oct 26, 2022 •

edited

ArtyomZemlyak commented Oct 26, 2022 •

edited

ArtyomZemlyak commented Oct 26, 2022 •

edited

cristianglezm commented Oct 28, 2022 •

edited

tazz4843 commented Oct 29, 2022 •

edited

jaybinks commented Nov 5, 2022 •

edited

niksedk commented Nov 9, 2022 •

edited

niksedk commented Nov 9, 2022 •

edited

j1nx commented Nov 17, 2022 •

edited

StuartIanNaylor commented Nov 17, 2022 •

edited

dodysw commented Nov 17, 2022 •

edited

matth commented Nov 21, 2022 •

edited

vieenrose commented Aug 6, 2023 •

edited

vieenrose commented Aug 7, 2023 •

edited

henry2man commented Oct 12, 2023 •

edited

StuartIanNaylor commented Nov 3, 2023 •

edited

ggerganov commented Nov 3, 2023 •

edited

StuartIanNaylor commented Nov 4, 2023 •

edited

StuartIanNaylor commented Nov 5, 2023 •

edited

zhouwg commented Mar 6, 2024 •

edited