Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to CPU with OOM even though GPU *should* have more than enough #1543

Open
1 of 2 tasks
kasperske opened this issue Oct 20, 2023 · 24 comments
Open
1 of 2 tasks
Labels
backend gpt4all-backend issues vulkan

Comments

@kasperske
Copy link

System Info

version: 1.0.12
platform: windows
python: 3.11.4
graphics card: nvidia rtx 4090 24gb

Information

  • The official example notebooks/scripts
  • My own modified scripts

Reproduction

run the following code

from gpt4all import GPT4All
model = GPT4All("wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0", device='gpu') # device='amd', device='intel'
output = model.generate("Write a Tetris game in python scripts", max_tokens=4096); print(output)

Expected behavior

Found model file at  C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama.cpp: loading model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.73 MB
Error allocating memory ErrorOutOfDeviceMemory
error loading model: Error allocating vulkan memory.
llama_load_model_from_file: failed to load model
LLAMA ERROR: failed to load model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
LLaMA ERROR: prompt won't work with an unloaded model!
@cebtenzzre cebtenzzre mentioned this issue Oct 23, 2023
2 tasks
@cebtenzzre cebtenzzre changed the title LLAMA ERROR: failed to load model from ... Error allocating memory ErrorOutOfDeviceMemory Oct 23, 2023
@cebtenzzre cebtenzzre added backend gpt4all-backend issues vulkan labels Oct 24, 2023
@manyoso
Copy link
Collaborator

manyoso commented Oct 28, 2023

This is because you don't have enough VRAM available to load the model. Yes, I know your GPU has a lot of VRAM but you probably have this GPU set in your BIOS to be the primary GPU which means that Windows is using some of it for the Desktop and I believe the issue is that although you have a lot of shared memory available, it isn't contiguous because of fragmentation due to Windows.

@manyoso manyoso mentioned this issue Oct 28, 2023
10 tasks
@manyoso manyoso changed the title Error allocating memory ErrorOutOfDeviceMemory Fallback to CPU with OOM even though GPU *should* have more than enough Oct 28, 2023
@PHIL-GIBSON-1990
Copy link

This is because you don't have enough VRAM available to load the model. Yes, I know your GPU has a lot of VRAM but you probably have this GPU set in your BIOS to be the primary GPU which means that Windows is using some of it for the Desktop and I believe the issue is that although you have a lot of shared memory available, it isn't contiguous because of fragmentation due to Windows.

Absolutely not the case. I have tried loading a model that will take at most 5-6Gb on my RTX 3090 and it doesn't work. I can load up other machine learning applications and use 20Gb. There is definitely a problem here. Sitting on desktop DOES NOT take 20+ Gb of VRAM.

@cebtenzzre
Copy link
Member

I believe what manyoso is saying is that our Vulkan backend currently requires a contiguous chunk of memory to be available, as it allocates one big chunk instead of smaller chunks like other machine learning frameworks do. This means it would probably work fine if you didn't have other things using small chunks in the middle of your VRAM. We still intend to fix this issue :)

@BryceDrechselSmith
Copy link

It seems that there is no way around this? I have dual 3090s, and specifically selecting either of them will throw this error. Im not sure that the information about "contiguous blocks" in memory is useful, as there is generally no way to enable specific use of GPU's in BIOS, and this really shouldn't be an issue as I understand it. Has anyone found a workaround?

@kalle07
Copy link

kalle07 commented Nov 26, 2023

on my 16GB RTX only models run with GPU smaller than 4GB
it uses 5GB of RAM, whether it is talking or not ... i can log it with GPU-Z

another model 8GB in size uses ~9GB VRAM and run only on cpu (always say "out VRAM")

-> So my conclusion is that it's a simple programming error as the model doesn't use that much more vram than its actual size

@kalle07
Copy link

kalle07 commented Nov 28, 2023

models that run on my 16GB RTX with GPU (how good i can not say) ;)

nearly all tinyllama models

and one german model
sauerkrautlm-3b-v1.Q4_1

and the build in download versions from
orca-2-7b.Q4_0.gguf
gpt4all falcon

often only the Q4 models are working

@cebtenzzre
Copy link
Member

often only the Q4 models are working

We only support GPU acceleration of Q4_0 and Q4_1 quantizations at the moment.

@kalle07
Copy link

kalle07 commented Dec 1, 2023

sauerkrautlm-7b-hero.Q5_K_M.gguf
a german model that runs on CPU but very well, also with local docs

@ewebgh33
Copy link

We only support GPU acceleration of Q4_0 and Q4_1 quantizations at the moment.

I can't load a Q4_0 into VRAM on either of my 4090s, each with 24gb.
Came to this issue after my duplicate issue was closed.

I can literally open the exact model downloaded by GPT4All, orca-2-13b.Q4_0.gguf, in textgen-webui, offload ALL layers to GPU, and see a speed increase.
I can use the exact same model as GPTQ and see a HUGE speed increase even over the GGUF-when-fully-in-VRAM.

Why can't we use GPTQ? I don't understand why so many LLM apps are so limited, and so dead-set on slow, CPU generation. Why not just include the option for GPU by default and fall back to CPU for those that don't have it? Let's face it, not many people on PC are trying out local LLMs without GPUs.

Anyway, you say you support Q4_0 and Q4_1 for GPU but that model WILL NOT load into VRAM when I have 24gb, and it WILL load the exact same file into VRAM using a different LLM app. So the problem is clearly with GPT4All.

@cebtenzzre
Copy link
Member

I can't load a Q4_0 into VRAM on either of my 4090s, each with 24gb.

Just so you're aware, GPT4All uses a completely different GPU backend than the other LLM apps you're familiar with - it's an original implementation based on Vulkan. It's still in its early stages (because bugs like this need to be fixed before it can be considered mature), but the main benefit is that it's easy to support NVIDIA, AMD, and Intel all with the same code.

exllama2 is great if you have two 4090s - GPT4All in its current state probably isn't for you, as it definitely doesn't take full advantage of your hardware. But many of our users do not have access to such impressive GPUs (myself included) and benefit from features that llama.cpp makes it relatively easy to support, such as partial GPU offload - which we haven't implemented yet, but plan to.

@ewebgh33
Copy link

ewebgh33 commented Dec 21, 2023

Thanks for the explanation!
So basically I need to wait until this Vulkan thing is... better?

I appreciate you want to support all of Mac, AMD and Nvidia, that's a great goal. But I agree I would be more likely to make this my main app, at a time when full GPU support can come in.

The main reason I am looking at tools like GPT4All is that the more basic tools like textgen-webui or LMStudio don't have pipelines for RAG. GPT4All had a few recommendations to me from a reddit post where I asked about various LLM+RAG pipelines, so I wanted to test it out.

I've tested a few now, and similar to GPT4all, I end up finding they're all CPU bound with rough or no support for GPU. Honestly the speed of CPU is incredibly painful and I can't live with that slow speed! :)

@ewebgh33
Copy link

What about adding ability to connect to a different API? textgen-webui supports OpenAI API standard, and in fact other LLM apps can connect to textgen-webui for GPU support. Would you consider adding that ability as a stopgap, until Vulkan improves? It would keep your existing compatibilities with Mac/AMD but open up a whole new world to other GPU users.

@cebtenzzre
Copy link
Member

What about adding ability to connect to a different API?

Since we already support connecting to ChatGPT, that would be a reasonable feature request - you should open an issue for it.

@ewebgh33
Copy link

Thanks, I'll look into opening one. Glad to hear you are receptive to this!

I have found in other LLM apps that it takes a dedicated GUI option somewhere, as locally the API doesn't require authentication as OpenAI does, it's simply an end-point. Or something like that anyway!

Cheers and thanks again,
Em

@BryceDrechselSmith
Copy link

I had these issues and switched over to using transformers (huggingfacepipeline) and now i can take advantage of dual 3090's

@kalle07

This comment was marked as off-topic.

@kalle07
Copy link

kalle07 commented Jan 20, 2024

possible GPU usage ... tested with version 1.6.1 today !

my VRAM is 16GB (RTX4060)
only models lower than 3.8GB works on my GPU (without docs, that always run on CPU)

so try models like:
wizardlm-7b-v1.0-uncensored.Q3_K_M.gguf
open_llama_3b_code_instruct_0.1.q4_k_m.gguf
syzymon-long_llama_3b_instruct-Q4_K_M.gguf

or modelsize lower than the 1/4 of your max VRAM

THY me alone :)

cebtenzzre referenced this issue Feb 2, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre mentioned this issue Feb 7, 2024
2 tasks
@fanoush
Copy link

fanoush commented Feb 15, 2024

I am in the same boat. I can load 8GB models but cannot load 16GB ones

GPU processor:		NVIDIA RTX A2000 8GB Laptop GPU
Driver version:		532.09
Total available graphics memory:	24411 MB
Dedicated video memory:	8192 MB GDDR6
System video memory:	0 MB
Shared system memory:	16219 MB

There is 8+16=24GB total available. When loading 8GB models I can see about half goes to shared RAM and half to dedicated so after loading 8GB model I still have about 4GB free in dedicated VRAM. I can even run gpt4all twice and load two 8gb models filling the dedicated memory to 8GB however with one 16GB model I get "out of VRAM?".

@kalle07
Copy link

kalle07 commented Feb 15, 2024

its an error related on Vulkan
they believe in it since 5 month ^^

@manyoso
Copy link
Collaborator

manyoso commented Mar 6, 2024

You should be able to use partial offloading now to load some number of the layers of the model into VRAM even for 16GB models. I'm going to wait a bit for those who have experienced issues to comment and verify they can use partial offloading, but in lieu of no new comments this issue will be closed as fixed due to partial offloading now supported.

@cebtenzzre
Copy link
Member

in lieu of no new comments this issue will be closed as fixed due to partial offloading now supported.

Although it may be possible for some users to mitigate this issue with partial offloading, it is still an issue - people should be able to fully offload models with GPT4All that they can fully offload with every other LLM app using the CUDA backend.

@manyoso
Copy link
Collaborator

manyoso commented Mar 6, 2024

Although it may be possible for some users to mitigate this issue with partial offloading, it is still an issue - people should be able to fully offload models with GPT4All that they can fully offload with every other LLM app using the CUDA backend.

If this is the case, then we must be requiring some flag on the memory that others do not if the issue is not one of contiguous regions

@fanoush
Copy link

fanoush commented Mar 10, 2024

Thanks for the tip, I can confirm I can load 16GB Wizard 1.2 when reducing GPU layers to 36 with NVIDIA RTX A2000 8GB, then the allocation looks like this
image

@gtbu
Copy link

gtbu commented Apr 19, 2024

The problem with with partial offloading is difficult : If You work with better resolution than GPT4ALL standard Q4 (Q6 is recommended for professional work) You need about 48GB and more (34b needs even more).
P.S. Some AMD - cards like R5-430 see (in the windows taskmanager) the rest of the computer-ram as reserve (and include it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend gpt4all-backend issues vulkan
Projects
None yet
Development

No branches or pull requests

9 participants