Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTX 3090 insane low speed #11

Open
davizca opened this issue Dec 6, 2023 · 13 comments
Open

RTX 3090 insane low speed #11

davizca opened this issue Dec 6, 2023 · 13 comments

Comments

@davizca
Copy link

davizca commented Dec 6, 2023

Hi.

I'm using RTX 3090 GPU with 24 GB VRAM and I think there is something wrong.

35faf802b17d4b69e41f33b9a901fc26
52b3d01544b5f18eace9520d6391c94a

Theorically it should be 3 minutes or so and nope.

Also posted on reddit

Cheers!

@RuoyiDu
Copy link
Member

RuoyiDu commented Dec 6, 2023

Hi @davizca, please try to set view_batch_size to 16. It should work for 3090 and will make inference faster.

@davizca
Copy link
Author

davizca commented Dec 6, 2023

Hi @RuoyiDu, thanks for the answer.

I did view batch size to 16 and for a 2048x2048 on phase 2 decoding is giving me like +12 mins (and it's increasing slowly, I guess its the same as before). 1024x1024 is running super fast though.

Settings and screenshot added:

f03bdc52b8a27a9df46b735f4cadf012
4edf54d6da190bf02947f77c8acf4ad6

Cheers.

@RuoyiDu
Copy link
Member

RuoyiDu commented Dec 6, 2023

Hi @davizca, this is very strange now. Are you running on a laptop with RTX3090? The power of the GPU also affects the inference time -- I'm using the RTX3090 on a local server with the power of 350W. You can check the power by nvidia-smi.

@davizca
Copy link
Author

davizca commented Dec 6, 2023

Hi @RuoyiDu
No, I'm using desktop PC with Wiindows and RTX 3090.

Nvidia-smi says 190 average W. And Board Draw Power (power) the same. Peaking 23.6 GB VRAM inferencing a 2048x2048 image. The part where it takes eternal time is phase 2 decoding (the previous are fast). I don't know if this is because some dependencies but if other users with RTX 3090 can test it will be awesome. I never got with this pipeline constant 350W of BDP.

t

@RuoyiDu
Copy link
Member

RuoyiDu commented Dec 7, 2023

Hi @davizca, on my server, it takes about 80s under full load.
截屏2023-12-07 01 56 07

I'll try to optimise the speed of the decoding. But it looks like there are other reasons here for it being especially slow at your end. Let's see if anyone else in the community is experiencing similar issues.

@siraxe
Copy link

siraxe commented Dec 7, 2023

3090 on pc
At phase 2 Decoding at 2k resolution it throws work into Shared GPU memory and slows down to unusable point
WindowsTerminal_j4WHv00GvA
Taskmgr_qWv2vMLOZf

@RuoyiDu
Copy link
Member

RuoyiDu commented Dec 7, 2023

Hi @siraxe @davizca. Can you guys try to generate at 2048x2048 and set multi_decoder=False? For generating 2048x2048 images on 3090, we don't need the tiled decoder. Then we can see if the problem is with the tiled decoder.

@siraxe
Copy link

siraxe commented Dec 8, 2023

Hi @siraxe @davizca. Can you guys try to generate at 2048x2048 and set multi_decoder=False? For generating 2048x2048 images on 3090, we don't need the tiled decoder. Then we can see if the problem is with the tiled decoder.

pycharm64_cE0UfnbgOs
firefox_5bgWRJWcyU
Okay that helped , about 328 second for 50 steps👍

@RuoyiDu
Copy link
Member

RuoyiDu commented Dec 8, 2023

Thanks @siraxe! But it's still much slower than on my machine... It seems the decoder is quite slow on your PC, which makes it ridiculously slow when using tiled decoder. I will try to figure out the reason -- but it may be a little hard for me since I can't reproduce this issue on my end.

BTW, I like your generation! Hope you can enjoy it!

@Yggdrasil-Engineering
Copy link

Was also seeing super slow times on my 4090. Set multi_decoder=False and speed dramatically improved!
It's amazing what the parameters being piped can do to generation times.

With a low batch size of 4, and multi-decoding set to true I was seeing hour long generation times. Down to 6 minutes now that I've fixed those! Hope this information is helpful.

@davizca
Copy link
Author

davizca commented Dec 8, 2023

EI hi. Thanks everyone for checking into this. Currently I'm not at home but on Monday will try the fix. Its weird the difference in inferencing times of @RuoyiDu and the others... we will see Whats happening here ;)

@RuoyiDu
Copy link
Member

RuoyiDu commented Dec 10, 2023

Hi guys @davizca @siraxe @Yggdrasil-Engineering, I find a little mistake at line #607:
pad_size = self.unet.config.sample_size // 4 * 3
should be
pad_size = self.unet.config.sample_size // 8 * 3.
This should make the VRAM cost in line with the paper (about 17GB) and also make it decodes faster when multi_decoder=True.
But this bug doesn't affect the result of multi_decoder=False. So I think there might be other reasons, like GPU power (I'm using a 350W RTX 3090 instead of a 280W one).

@davizca
Copy link
Author

davizca commented Dec 11, 2023

@RuoyiDu
With multidecoder = TRUE (normal settings, 2048x2048):

Phase 1 Denoising

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [00:39<00:00, 1.27it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:22<00:00, 3.21s/it]

Phase 1 Denoising

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [00:13<00:00, 3.58it/s]

Phase 2 Denoising

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [02:56<00:00, 3.42s/it]### Phase 2 Decoding ###
100%|██████████████████████████████████████████████████████████████████████████████████| 64/64 [00:23<00:00, 2.70it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [03:24<00:00, 4.09s/it]

With multidecoder = False (same settings):
3:30 more or less the same. (Will make screenshot later).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants