-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-gpu support #5997
base: main
Are you sure you want to change the base?
Add multi-gpu support #5997
Conversation
Lincoln, please stop tempting me to buy another RTX 4090. |
Should be waiting on the resume event instead of checking it in a loop
Prefer an early return/continue to reduce the indentation of the processor loop. Easier to read. There are other ways to improve its structure but at first glance, they seem to involve changing the logic in scarier ways.
I just noticed that the changes to the way VRAM loading is handled is consuming more memory than it should. I’m going to revert to the current method and work on this in a separate PR. (These changes are not related to the multi-GPU support). |
While the code changes are not huge, this is still a very substantial change without a way to strictly feature-flag the multi-GPU handling. Properly testing this will require carefully monitored testing on a staging environment. We don't have capacity to do that and I can't give a solid timeline for when we will. |
@lstein You're my hero! Can you hide it behind a checkbox, a setting or the env variable? Just to merge this feature and prevent @psychedelicious from worrying too much. |
@makemefeelgr8 Sorry but it's not that simple. This change needs to wait until we can allocate resources to do thorough testing. |
Summary
This adds support for systems that have multiple GPUs. On CUDA systems, it will automatically detect when a system has more than one GPU and configure the model cache and the session processor to take advantage of them, keeping track of which GPUs are busy and which are available, and rendering batches of images in parallel. It works at the session processor level by placing each session into a thread-safe queue that is monitored by multiple threads. Each thread reserves a GPU at entry, processes the entire invocation, and then releases the GPU to be used by other pending requests.
Demo
cinnamon-2024-04-16T152651-0400.webm
How it works
In addition to changes in the session processor, this PR adds a few calls to the model manager's RAM cache to reserve and release GPUs in a thread-safe way, and extends the TorchDevice class to support dynamic device selection without changing its API. The PR also improves how models are moved from RAM to VRAM to increase load speed modestly. During debugging, I discovered that
uuid.uuid4()
does not appear to be thread-safe on Windows platforms (https://stackoverflow.com/questions/2759644/python-multiprocessing-doesnt-play-nicely-with-uuid-uuid4), and this was borking the latent caching system. I worked around this by adding the current thread ID to the cache object's name.There are two new options for the config file:
max_threads
-- specify the maximum number of session processing threads that can run at the same time. If not defined, will set this equal to the number of GPU devices.devices
-- a list of devices to use for acceleration. If not defined, this will be dynamically calculated to use all CUDA GPUs found.Example:
Note that there is no problem if
max_threads
does not match the number of GPU devices (even on single-GPU systems), but there won't be any benefit to defining more threads than GPUs.The code is currently tested and working using multiple threads on a 6-GPU Windows machine.
To test
First, buy yourself two RTX 4090s :-).
Seriously, though, the best thing to do is to do ensure that this doesn't crash single-GPU systems. Exercise the linear and graph workflows. Try different models, loras, IP adapters, upscalers, etc. Run a couple large batches and make sure that they can be paused, resumed and cancelled as usual.
If you have access to a system that has an integrated GPU as well as a discrete one, you can test out the multi-GPU processing simply by queueing up a series of 2 or more generation jobs.
QA Instructions
Squash merge when approved.
Merge Plan
Checklist