New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can it run on multi-GPU? #15
Comments
@dvmazur @lavawolfiee Can you please kindly address this question? I'd be happy to do this myself if it's not already possible, which I don't think it is, if you could point me to where I'd need to make changes. |
Hi! Sorry for the long reply. Running the model on multi-GPU is not currently supported. Currently, all active experts are sent to cuda:0. You can send an expert to a different GPU by simply specifying a different device while initializing Keep I'm mind that you would need to ballance the number of active experts between your GPUs. This logic could be added to the |
By the way, one of our quantization setups compressed the model to 17Gb. This would fit into the VRAM of two T4 GPUs, which you can get for free on Kaggle. Have you looked into running a quantized version (possibly ours) of the model using tensor_parallel? |
Hi @dvmazur! Thank you for your reply. Unfortunately (or fortunately) I have 8 1080ti GPUs on my machine, which individually cannot seem to handle the model even with quantization and when Thank you for you suggestions, I'll have a look at the
The following pic shows the GPU utilization right before the kernel dies. |
It's the 4-bit attention and 2-bit expert setup from our tech-report. I suppose the weights can be found here. Let's summon @lavawolfiee just in case I'm mistaken. |
Could you provide a bit more detail? I'll look into it as soon as I have the time to. |
Yes, you're right |
This seems to be the same setup I have used in the code I provided, which occupies ~11Gb VRAM and ~23Gb of CPU RAM and then crashes the kernel at inference. |
Absolutely, what information are you looking for? |
A stacktrace would be helpful. |
Thanks for your contributions. I would like to know whether it can be deployed on multi-GPU to allow the use of more VRAM?
The text was updated successfully, but these errors were encountered: