-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fallback to CPU with OOM even though GPU *should* have more than enough #1543
Comments
This is because you don't have enough VRAM available to load the model. Yes, I know your GPU has a lot of VRAM but you probably have this GPU set in your BIOS to be the primary GPU which means that Windows is using some of it for the Desktop and I believe the issue is that although you have a lot of shared memory available, it isn't contiguous because of fragmentation due to Windows. |
Absolutely not the case. I have tried loading a model that will take at most 5-6Gb on my RTX 3090 and it doesn't work. I can load up other machine learning applications and use 20Gb. There is definitely a problem here. Sitting on desktop DOES NOT take 20+ Gb of VRAM. |
I believe what manyoso is saying is that our Vulkan backend currently requires a contiguous chunk of memory to be available, as it allocates one big chunk instead of smaller chunks like other machine learning frameworks do. This means it would probably work fine if you didn't have other things using small chunks in the middle of your VRAM. We still intend to fix this issue :) |
It seems that there is no way around this? I have dual 3090s, and specifically selecting either of them will throw this error. Im not sure that the information about "contiguous blocks" in memory is useful, as there is generally no way to enable specific use of GPU's in BIOS, and this really shouldn't be an issue as I understand it. Has anyone found a workaround? |
on my 16GB RTX only models run with GPU smaller than 4GB another model 8GB in size uses ~9GB VRAM and run only on cpu (always say "out VRAM") -> So my conclusion is that it's a simple programming error as the model doesn't use that much more vram than its actual size |
models that run on my 16GB RTX with GPU (how good i can not say) ;) nearly all tinyllama models and one german model and the build in download versions from often only the Q4 models are working |
We only support GPU acceleration of Q4_0 and Q4_1 quantizations at the moment. |
sauerkrautlm-7b-hero.Q5_K_M.gguf |
I can't load a Q4_0 into VRAM on either of my 4090s, each with 24gb. I can literally open the exact model downloaded by GPT4All, orca-2-13b.Q4_0.gguf, in textgen-webui, offload ALL layers to GPU, and see a speed increase. Why can't we use GPTQ? I don't understand why so many LLM apps are so limited, and so dead-set on slow, CPU generation. Why not just include the option for GPU by default and fall back to CPU for those that don't have it? Let's face it, not many people on PC are trying out local LLMs without GPUs. Anyway, you say you support Q4_0 and Q4_1 for GPU but that model WILL NOT load into VRAM when I have 24gb, and it WILL load the exact same file into VRAM using a different LLM app. So the problem is clearly with GPT4All. |
Just so you're aware, GPT4All uses a completely different GPU backend than the other LLM apps you're familiar with - it's an original implementation based on Vulkan. It's still in its early stages (because bugs like this need to be fixed before it can be considered mature), but the main benefit is that it's easy to support NVIDIA, AMD, and Intel all with the same code. exllama2 is great if you have two 4090s - GPT4All in its current state probably isn't for you, as it definitely doesn't take full advantage of your hardware. But many of our users do not have access to such impressive GPUs (myself included) and benefit from features that llama.cpp makes it relatively easy to support, such as partial GPU offload - which we haven't implemented yet, but plan to. |
Thanks for the explanation! I appreciate you want to support all of Mac, AMD and Nvidia, that's a great goal. But I agree I would be more likely to make this my main app, at a time when full GPU support can come in. The main reason I am looking at tools like GPT4All is that the more basic tools like textgen-webui or LMStudio don't have pipelines for RAG. GPT4All had a few recommendations to me from a reddit post where I asked about various LLM+RAG pipelines, so I wanted to test it out. I've tested a few now, and similar to GPT4all, I end up finding they're all CPU bound with rough or no support for GPU. Honestly the speed of CPU is incredibly painful and I can't live with that slow speed! :) |
What about adding ability to connect to a different API? textgen-webui supports OpenAI API standard, and in fact other LLM apps can connect to textgen-webui for GPU support. Would you consider adding that ability as a stopgap, until Vulkan improves? It would keep your existing compatibilities with Mac/AMD but open up a whole new world to other GPU users. |
Since we already support connecting to ChatGPT, that would be a reasonable feature request - you should open an issue for it. |
Thanks, I'll look into opening one. Glad to hear you are receptive to this! I have found in other LLM apps that it takes a dedicated GUI option somewhere, as locally the API doesn't require authentication as OpenAI does, it's simply an end-point. Or something like that anyway! Cheers and thanks again, |
I had these issues and switched over to using transformers (huggingfacepipeline) and now i can take advantage of dual 3090's |
This comment was marked as off-topic.
This comment was marked as off-topic.
possible GPU usage ... tested with version 1.6.1 today ! my VRAM is 16GB (RTX4060) so try models like: or modelsize lower than the 1/4 of your max VRAM THY me alone :) |
I am in the same boat. I can load 8GB models but cannot load 16GB ones
There is 8+16=24GB total available. When loading 8GB models I can see about half goes to shared RAM and half to dedicated so after loading 8GB model I still have about 4GB free in dedicated VRAM. I can even run gpt4all twice and load two 8gb models filling the dedicated memory to 8GB however with one 16GB model I get "out of VRAM?". |
its an error related on Vulkan |
You should be able to use partial offloading now to load some number of the layers of the model into VRAM even for 16GB models. I'm going to wait a bit for those who have experienced issues to comment and verify they can use partial offloading, but in lieu of no new comments this issue will be closed as fixed due to partial offloading now supported. |
Although it may be possible for some users to mitigate this issue with partial offloading, it is still an issue - people should be able to fully offload models with GPT4All that they can fully offload with every other LLM app using the CUDA backend. |
If this is the case, then we must be requiring some flag on the memory that others do not if the issue is not one of contiguous regions |
The problem with with partial offloading is difficult : If You work with better resolution than GPT4ALL standard Q4 (Q6 is recommended for professional work) You need about 48GB and more (34b needs even more). |
System Info
version: 1.0.12
platform: windows
python: 3.11.4
graphics card: nvidia rtx 4090 24gb
Information
Reproduction
run the following code
Expected behavior
The text was updated successfully, but these errors were encountered: