CPU Performance Regression? (Older version much faster) #2099

nanocosmos-ol · 2024-04-28T17:41:58Z

I compared an older version from Nov 23 with Apr 24, and the older version is much faster.

total time = 6225.76 ms
vs
total time = 3817.54 ms

Same CPU, same compiler and settings, same test:

git clone whisper.cpp
git reset --hard $COMMIT (with the commits below)
make -j
bash ./models/download-ggml-model.sh base.en
./bench -w 0

CPU: AMD Ryzen 9 7950X3D 16-Core

commit 858452d Date: Wed Apr 24 14:56:30 2024 +0300

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 147.37 MB
whisper_model_load: model size = 147.37 MB
whisper_init_state: kv self size = 16.52 MB
whisper_init_state: kv cross size = 18.43 MB
whisper_init_state: compute buffer (conv) = 16.39 MB
whisper_init_state: compute buffer (encode) = 132.07 MB
whisper_init_state: compute buffer (cross) = 4.78 MB
whisper_init_state: compute buffer (decode) = 96.48 MB

whisper_print_timings: load time = 64.61 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 878.59 ms / 1 runs ( 878.59 ms per run)
whisper_print_timings: decode time = 935.20 ms / 256 runs ( 3.65 ms per run)
whisper_print_timings: batchd time = 544.69 ms / 320 runs ( 1.70 ms per run)
whisper_print_timings: prompt time = 3865.51 ms / 4096 runs ( 0.94 ms per run)
whisper_print_timings: total time = 6225.76 ms

commit d03c60d Date: Wed Nov 8 04:53:31 2023 +0700

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: model ctx = 140.66 MB
whisper_model_load: model size = 140.54 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
whisper_init_state: compute buffer (conv) = 18.50 MB
whisper_init_state: compute buffer (encode) = 81.95 MB
whisper_init_state: compute buffer (cross) = 4.49 MB
whisper_init_state: compute buffer (decode) = 24.70 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

whisper_print_timings: load time = 83.24 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 693.48 ms / 1 runs ( 693.48 ms per run)
whisper_print_timings: decode time = 874.80 ms / 256 runs ( 3.42 ms per run)
whisper_print_timings: prompt time = 2249.08 ms / 16 runs ( 140.57 ms per run)
whisper_print_timings: total time = 3817.54 ms

See #89 (comment)

The text was updated successfully, but these errors were encountered:

przemoc · 2024-04-28T19:10:58Z

Thank you for the report.

Can you provide what's your current OS and compiler?
Were they the same one for the older commit? _{EDIT: Sorry, I missed that you confirmed it's the same compiler.}

Could you try running make with AVX512F_M= AVX512VNNI_M= AVX512VBMI_M= so that AVX512 would not be used?

That could make your new run potentially a bit more comparable to the old one.
(I don't know if slow AVX512 may be the issue here, but it may be worth trying.)

nanocosmos-ol · 2024-04-28T20:16:58Z

It is Ubuntu 22.04.4. All running on the same machine in different folders, fresh compiled.

Without AVX512 it is a bit better indeed, but still not the same, somehow in the middle.

total time = 5086.27 ms

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 147.37 MB
whisper_model_load: model size = 147.37 MB
whisper_init_state: kv self size = 16.52 MB
whisper_init_state: kv cross size = 18.43 MB
whisper_init_state: compute buffer (conv) = 16.39 MB
whisper_init_state: compute buffer (encode) = 132.07 MB
whisper_init_state: compute buffer (cross) = 4.78 MB
whisper_init_state: compute buffer (decode) = 96.48 MB

whisper_print_timings: load time = 56.47 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 852.86 ms / 1 runs ( 852.86 ms per run)
whisper_print_timings: decode time = 622.78 ms / 256 runs ( 2.43 ms per run)
whisper_print_timings: batchd time = 323.46 ms / 320 runs ( 1.01 ms per run)
whisper_print_timings: prompt time = 3286.05 ms / 4096 runs ( 0.80 ms per run)
whisper_print_timings: total time = 5086.27 ms

przemoc · 2024-04-28T20:38:34Z

You may want to try also with --beam-size 2, as that's what seems to be the default in the older commit. It was changed in b6c5f49. As Georgi commented in some other issue:

The quality with more beams in general should be better, but it's possible that you don't observe much of a difference

nanocosmos-ol · 2024-04-28T21:06:32Z

It looks like the former default was beam-size=-1 ?
This then switches the strategy WHISPER_SAMPLING_BEAM_SEARCH / WHISPER_SAMPLING_GREEDY

bench doesnt support beam-size, so I am trying a real wav file, and it does improve the speed
(still not like the old version but closer)

default branch (new, with AVX512):

AVX512=1 beam-size 5 default total time = 20678.01 ms

AVX512=1 beam-size 2 total time = 17052.18 ms

AVX512=1 beam-size -1 total time = 15465.98

AVX512=0 beam-size 5 default total time = 19365.01 ms

AVX512=0 beam-size 2 total time = 15219.21 ms

AVX512=0 beam-size -1 total time = 13869.20 ms

Old version:

AVX512=0 beam-size 5 total time = 21862.52 ms

AVX512=0 beam-size 2 total time = 14704.33 ms

AVX512=0 beam-size -1 default total time = 12398.81 ms

Interesting results, especially the AVX issue. We'll play around with it a bit.

Thanks for your help!

(note: beam-search default seems to have changed from -1 to 2 to 5 now : https://github.com/ggerganov/whisper.cpp/blob/master/whisper.cpp#L4625 )

przemoc · 2024-04-28T22:06:44Z

It looks like the former default was beam-size=-1 ?

I was referring to changes in whisper_full_default_params, where beam_search.beam_size changed from 2 to 5, but you're right that whisper_params.beam_size previously did not use whisper_full_default_params(), and it was set to -1.

So you may want to try --beam-size 1 too, I guess.

Fox AVX-512 vs Ryzen let me mention:
Zen4's AVX512 Teardown

Ubuntu 22.04 has relatively old compiler. Results from more recent maybe could be different.

I'm wondering if WHISPER_NO_AVX512 shouldn't be introduced in Makefile, to make it easier to disable AVX-512 (setting 3 variables is relatively cumbersome). Maybe we should even set such WHISPER_NO_AVX512 to 1 by default, but we would need to have bigger sample to be able to decide if more folks are harmed performance-wise by having AVX-512 enabled than by having it disabled. Autodetection that is done in Makefile assumes that adding more ISA extensions allow compiler to do better job (produce more efficient code), but that may not always be the case, as we can see in this issue.

Linux13524 · 2024-04-30T10:22:33Z

We are experiencing a similar behavior when comparing version 1.4.3 with the latest 1.5.5.
But since we are using CMake for the build I guess it cannot be related to AVX512, because WHISPER_NO_AVX512 is set by default, right?

Also we are not using beam search (by setting whisper_full_default_params(WHISPER_SAMPLING_GREEDY)) so this should also not affect the performance, right?

Seems the greedy.best_of also changed from 2 to 5. But when I change it back the performance does not change much. So I guess this is also unrelated..

przemoc · 2024-04-30T10:35:23Z

Could you do git bisect between good (v1.4.3) and bad (v1.5.5) to try to locate the main commit responsible for performance drop in your environment?
Less than 10 steps (experiments) should suffice.

Linux13524 · 2024-04-30T15:28:02Z

Ok, so my git bisect gives me the following result:

3e5c7feeffb86555d63ef592f79ce8365a069174 is the first bad commit
commit 3e5c7feeffb86555d63ef592f79ce8365a069174
Author: Evan Jones <evan.q.jones@gmail.com>
Date:   Mon Nov 13 03:51:34 2023 -0500

    whisper : add grammar-based sampling (#1229)
    
    * whisper : add grammar-based sampling
    
    * build : fix after master merge
    
    * command : fix exception when recognizing the command
    
    * whisper : fine-tuning grammar functionality
    
    * command : grammar-related improvements
    
    - option to read grammar from file
    - add sample grammars for colors and chess moves
    - fine-tune the performance further
    
    * grammars : add assistant + update comments
    
    * command : enable beam-search, add "no_timestamps", add "context", add p
    
    * whisper : remove comment
    
    ---------
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Any idea how this commit could influence the performance in such a bad way?

BTW: The performance drop on our side is about ~40%

Linux13524 · 2024-05-02T12:07:14Z

I dropped this comment to test if it will fix the performance problem but it didn't.
So I did another git bisect and the next bad commit is "whisper : add batched decoding (#1486)" (b6c5f49), but I don't think I can easily drop this one..

Is there anything else I can try out based on these informations?

przemoc · 2024-05-06T17:22:53Z

@ggerganov, do you have any ideas what else @Linux13524 could try or tweak in pursuit of restoring whisper.cpp performance from 1.4.x in 1.5.x?

ggerganov · 2024-05-07T17:36:57Z

Hm not sure. @Linux13524 Is this CPU-only or using CUDA / Metal backend?

Linux13524 · 2024-05-07T17:50:42Z

We first noticed it while testing the new CUDA performance, but my git bisects from above I did using CPU-only.
The CUDA performance drop could also be due to something else tho. I cannot test this easily due to the lack of a NVIDIA GPU in my notebook.

przemoc · 2024-05-07T20:15:41Z

Just a side comment and follow-up to my earlier comment:

I'm wondering if WHISPER_NO_AVX512 shouldn't be introduced in Makefile, to make it easier to disable AVX-512 (setting 3 variables is relatively cumbersome).

I made:

build : improve disabling AVX-512 #2129

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Performance Regression? (Older version much faster) #2099

CPU Performance Regression? (Older version much faster) #2099

nanocosmos-ol commented Apr 28, 2024 •

edited

przemoc commented Apr 28, 2024 •

edited

nanocosmos-ol commented Apr 28, 2024

przemoc commented Apr 28, 2024

nanocosmos-ol commented Apr 28, 2024 •

edited

przemoc commented Apr 28, 2024 •

edited

Linux13524 commented Apr 30, 2024

przemoc commented Apr 30, 2024

Linux13524 commented Apr 30, 2024

Linux13524 commented May 2, 2024 •

edited

przemoc commented May 6, 2024

ggerganov commented May 7, 2024

Linux13524 commented May 7, 2024

przemoc commented May 7, 2024

CPU Performance Regression? (Older version much faster) #2099

CPU Performance Regression? (Older version much faster) #2099

Comments

nanocosmos-ol commented Apr 28, 2024 • edited

przemoc commented Apr 28, 2024 • edited

nanocosmos-ol commented Apr 28, 2024

przemoc commented Apr 28, 2024

nanocosmos-ol commented Apr 28, 2024 • edited

przemoc commented Apr 28, 2024 • edited

Linux13524 commented Apr 30, 2024

przemoc commented Apr 30, 2024

Linux13524 commented Apr 30, 2024

Linux13524 commented May 2, 2024 • edited

przemoc commented May 6, 2024

ggerganov commented May 7, 2024

Linux13524 commented May 7, 2024

przemoc commented May 7, 2024

nanocosmos-ol commented Apr 28, 2024 •

edited

przemoc commented Apr 28, 2024 •

edited

nanocosmos-ol commented Apr 28, 2024 •

edited

przemoc commented Apr 28, 2024 •

edited

Linux13524 commented May 2, 2024 •

edited