Fallback to CPU with OOM even though GPU should have more than enough #1543

kasperske · 2023-10-20T16:22:34Z

System Info

version: 1.0.12
platform: windows
python: 3.11.4
graphics card: nvidia rtx 4090 24gb

Information

The official example notebooks/scripts
My own modified scripts

Reproduction

run the following code

from gpt4all import GPT4All
model = GPT4All("wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0", device='gpu') # device='amd', device='intel'
output = model.generate("Write a Tetris game in python scripts", max_tokens=4096); print(output)

Expected behavior

Found model file at  C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama.cpp: loading model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.73 MB
Error allocating memory ErrorOutOfDeviceMemory
error loading model: Error allocating vulkan memory.
llama_load_model_from_file: failed to load model
LLAMA ERROR: failed to load model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
LLaMA ERROR: prompt won't work with an unloaded model!

The text was updated successfully, but these errors were encountered:

manyoso · 2023-10-28T22:07:19Z

This is because you don't have enough VRAM available to load the model. Yes, I know your GPU has a lot of VRAM but you probably have this GPU set in your BIOS to be the primary GPU which means that Windows is using some of it for the Desktop and I believe the issue is that although you have a lot of shared memory available, it isn't contiguous because of fragmentation due to Windows.

PHIL-GIBSON-1990 · 2023-10-29T04:29:20Z

This is because you don't have enough VRAM available to load the model. Yes, I know your GPU has a lot of VRAM but you probably have this GPU set in your BIOS to be the primary GPU which means that Windows is using some of it for the Desktop and I believe the issue is that although you have a lot of shared memory available, it isn't contiguous because of fragmentation due to Windows.

Absolutely not the case. I have tried loading a model that will take at most 5-6Gb on my RTX 3090 and it doesn't work. I can load up other machine learning applications and use 20Gb. There is definitely a problem here. Sitting on desktop DOES NOT take 20+ Gb of VRAM.

cebtenzzre · 2023-10-29T05:03:19Z

I believe what manyoso is saying is that our Vulkan backend currently requires a contiguous chunk of memory to be available, as it allocates one big chunk instead of smaller chunks like other machine learning frameworks do. This means it would probably work fine if you didn't have other things using small chunks in the middle of your VRAM. We still intend to fix this issue :)

BryceDrechselSmith · 2023-11-13T22:05:20Z

It seems that there is no way around this? I have dual 3090s, and specifically selecting either of them will throw this error. Im not sure that the information about "contiguous blocks" in memory is useful, as there is generally no way to enable specific use of GPU's in BIOS, and this really shouldn't be an issue as I understand it. Has anyone found a workaround?

kalle07 · 2023-11-26T16:40:26Z

on my 16GB RTX only models run with GPU smaller than 4GB
it uses 5GB of RAM, whether it is talking or not ... i can log it with GPU-Z

another model 8GB in size uses ~9GB VRAM and run only on cpu (always say "out VRAM")

-> So my conclusion is that it's a simple programming error as the model doesn't use that much more vram than its actual size

kalle07 · 2023-11-28T20:01:02Z

models that run on my 16GB RTX with GPU (how good i can not say) ;)

nearly all tinyllama models

and one german model
sauerkrautlm-3b-v1.Q4_1

and the build in download versions from
orca-2-7b.Q4_0.gguf
gpt4all falcon

often only the Q4 models are working

cebtenzzre · 2023-11-29T04:31:59Z

often only the Q4 models are working

We only support GPU acceleration of Q4_0 and Q4_1 quantizations at the moment.

kalle07 · 2023-12-01T16:19:17Z

sauerkrautlm-7b-hero.Q5_K_M.gguf
a german model that runs on CPU but very well, also with local docs

ewebgh33 · 2023-12-21T02:45:56Z

We only support GPU acceleration of Q4_0 and Q4_1 quantizations at the moment.

I can't load a Q4_0 into VRAM on either of my 4090s, each with 24gb.
Came to this issue after my duplicate issue was closed.

I can literally open the exact model downloaded by GPT4All, orca-2-13b.Q4_0.gguf, in textgen-webui, offload ALL layers to GPU, and see a speed increase.
I can use the exact same model as GPTQ and see a HUGE speed increase even over the GGUF-when-fully-in-VRAM.

Why can't we use GPTQ? I don't understand why so many LLM apps are so limited, and so dead-set on slow, CPU generation. Why not just include the option for GPU by default and fall back to CPU for those that don't have it? Let's face it, not many people on PC are trying out local LLMs without GPUs.

Anyway, you say you support Q4_0 and Q4_1 for GPU but that model WILL NOT load into VRAM when I have 24gb, and it WILL load the exact same file into VRAM using a different LLM app. So the problem is clearly with GPT4All.

cebtenzzre · 2023-12-21T03:02:46Z

I can't load a Q4_0 into VRAM on either of my 4090s, each with 24gb.

Just so you're aware, GPT4All uses a completely different GPU backend than the other LLM apps you're familiar with - it's an original implementation based on Vulkan. It's still in its early stages (because bugs like this need to be fixed before it can be considered mature), but the main benefit is that it's easy to support NVIDIA, AMD, and Intel all with the same code.

exllama2 is great if you have two 4090s - GPT4All in its current state probably isn't for you, as it definitely doesn't take full advantage of your hardware. But many of our users do not have access to such impressive GPUs (myself included) and benefit from features that llama.cpp makes it relatively easy to support, such as partial GPU offload - which we haven't implemented yet, but plan to.

ewebgh33 · 2023-12-21T03:09:52Z

Thanks for the explanation!
So basically I need to wait until this Vulkan thing is... better?

I appreciate you want to support all of Mac, AMD and Nvidia, that's a great goal. But I agree I would be more likely to make this my main app, at a time when full GPU support can come in.

The main reason I am looking at tools like GPT4All is that the more basic tools like textgen-webui or LMStudio don't have pipelines for RAG. GPT4All had a few recommendations to me from a reddit post where I asked about various LLM+RAG pipelines, so I wanted to test it out.

I've tested a few now, and similar to GPT4all, I end up finding they're all CPU bound with rough or no support for GPU. Honestly the speed of CPU is incredibly painful and I can't live with that slow speed! :)

ewebgh33 · 2023-12-21T03:19:13Z

What about adding ability to connect to a different API? textgen-webui supports OpenAI API standard, and in fact other LLM apps can connect to textgen-webui for GPU support. Would you consider adding that ability as a stopgap, until Vulkan improves? It would keep your existing compatibilities with Mac/AMD but open up a whole new world to other GPU users.

cebtenzzre · 2023-12-21T03:33:28Z

What about adding ability to connect to a different API?

Since we already support connecting to ChatGPT, that would be a reasonable feature request - you should open an issue for it.

ewebgh33 · 2023-12-21T04:06:42Z

Thanks, I'll look into opening one. Glad to hear you are receptive to this!

I have found in other LLM apps that it takes a dedicated GUI option somewhere, as locally the API doesn't require authentication as OpenAI does, it's simply an end-point. Or something like that anyway!

Cheers and thanks again,
Em

BryceDrechselSmith · 2023-12-21T17:50:44Z

I had these issues and switched over to using transformers (huggingfacepipeline) and now i can take advantage of dual 3090's

kalle07 · 2024-01-20T20:08:17Z

possible GPU usage ... tested with version 1.6.1 today !

my VRAM is 16GB (RTX4060)
only models lower than 3.8GB works on my GPU (without docs, that always run on CPU)

so try models like:
wizardlm-7b-v1.0-uncensored.Q3_K_M.gguf
open_llama_3b_code_instruct_0.1.q4_k_m.gguf
syzymon-long_llama_3b_instruct-Q4_K_M.gguf

or modelsize lower than the 1/4 of your max VRAM

THY me alone :)

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

fanoush · 2024-02-15T15:12:42Z

I am in the same boat. I can load 8GB models but cannot load 16GB ones

GPU processor:		NVIDIA RTX A2000 8GB Laptop GPU
Driver version:		532.09
Total available graphics memory:	24411 MB
Dedicated video memory:	8192 MB GDDR6
System video memory:	0 MB
Shared system memory:	16219 MB

There is 8+16=24GB total available. When loading 8GB models I can see about half goes to shared RAM and half to dedicated so after loading 8GB model I still have about 4GB free in dedicated VRAM. I can even run gpt4all twice and load two 8gb models filling the dedicated memory to 8GB however with one 16GB model I get "out of VRAM?".

kalle07 · 2024-02-15T16:26:23Z

its an error related on Vulkan
they believe in it since 5 month ^^

manyoso · 2024-03-06T13:52:41Z

You should be able to use partial offloading now to load some number of the layers of the model into VRAM even for 16GB models. I'm going to wait a bit for those who have experienced issues to comment and verify they can use partial offloading, but in lieu of no new comments this issue will be closed as fixed due to partial offloading now supported.

cebtenzzre · 2024-03-06T17:52:33Z

in lieu of no new comments this issue will be closed as fixed due to partial offloading now supported.

Although it may be possible for some users to mitigate this issue with partial offloading, it is still an issue - people should be able to fully offload models with GPT4All that they can fully offload with every other LLM app using the CUDA backend.

manyoso · 2024-03-06T21:44:46Z

Although it may be possible for some users to mitigate this issue with partial offloading, it is still an issue - people should be able to fully offload models with GPT4All that they can fully offload with every other LLM app using the CUDA backend.

If this is the case, then we must be requiring some flag on the memory that others do not if the issue is not one of contiguous regions

fanoush · 2024-03-10T07:43:02Z

Thanks for the tip, I can confirm I can load 16GB Wizard 1.2 when reducing GPU layers to 36 with NVIDIA RTX A2000 8GB, then the allocation looks like this

gtbu · 2024-04-19T14:30:20Z

The problem with with partial offloading is difficult : If You work with better resolution than GPT4ALL standard Q4 (Q6 is recommended for professional work) You need about 48GB and more (34b needs even more).
P.S. Some AMD - cards like R5-430 see (in the windows taskmanager) the rest of the computer-ram as reserve (and include it).

cebtenzzre mentioned this issue Oct 23, 2023

Out of memory #1541

Closed

2 tasks

cebtenzzre changed the title ~~LLAMA ERROR: failed to load model from ...~~ Error allocating memory ErrorOutOfDeviceMemory Oct 23, 2023

cebtenzzre added backend gpt4all-backend issues vulkan labels Oct 24, 2023

manyoso mentioned this issue Oct 28, 2023

Intel Arc Partial Support #1453

Closed

10 tasks

manyoso changed the title ~~Error allocating memory ErrorOutOfDeviceMemory~~ Fallback to CPU with OOM even though GPU *should* have more than enough Oct 28, 2023

cebtenzzre mentioned this issue Nov 11, 2023

"gpu loading failed (out of vram?)" on 16GB models using Radeon RX 7900 XTX 24GB #1639

Closed

2 tasks

cebtenzzre mentioned this issue Nov 24, 2023

CPU and GPU(nvidia) usage #1604

Closed

2 tasks

cebtenzzre mentioned this issue Dec 20, 2023

GPUs not used after selecting in settings. 24gb VRAM x 2cards, but "not enough VRAM" #1765

Closed

2 tasks

HyRespt mentioned this issue Dec 20, 2023

Falcon and TinyLlama producing garbage on RX 6700 XT #1768

Closed

This comment was marked as off-topic.

Sign in to view

This was referenced Jan 9, 2024

Not working on Nvidia 4060 #1812

Closed

Model loaded into GPU ram but 0% GPU usage #1803

Open

cebtenzzre mentioned this issue Jan 16, 2024

Gpt4All do not uses GPU #1843

Closed

cebtenzzre referenced this issue Feb 2, 2024

Fix VRAM leak when model loading fails (#1901)

10e3f7b

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre mentioned this issue Feb 7, 2024

Can't use GPU #1877

Closed

2 tasks

cebtenzzre mentioned this issue May 7, 2024

App uses CPU while GPU is selected and available #2265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback to CPU with OOM even though GPU should have more than enough #1543

Fallback to CPU with OOM even though GPU should have more than enough #1543

kasperske commented Oct 20, 2023

manyoso commented Oct 28, 2023

PHIL-GIBSON-1990 commented Oct 29, 2023

cebtenzzre commented Oct 29, 2023

BryceDrechselSmith commented Nov 13, 2023

kalle07 commented Nov 26, 2023

kalle07 commented Nov 28, 2023

cebtenzzre commented Nov 29, 2023

kalle07 commented Dec 1, 2023

ewebgh33 commented Dec 21, 2023

cebtenzzre commented Dec 21, 2023

ewebgh33 commented Dec 21, 2023 •

edited

ewebgh33 commented Dec 21, 2023

cebtenzzre commented Dec 21, 2023

ewebgh33 commented Dec 21, 2023

BryceDrechselSmith commented Dec 21, 2023

This comment was marked as off-topic.

kalle07 commented Jan 20, 2024

fanoush commented Feb 15, 2024

kalle07 commented Feb 15, 2024

manyoso commented Mar 6, 2024

cebtenzzre commented Mar 6, 2024

manyoso commented Mar 6, 2024

fanoush commented Mar 10, 2024

gtbu commented Apr 19, 2024 •

edited

Fallback to CPU with OOM even though GPU *should* have more than enough #1543

Fallback to CPU with OOM even though GPU *should* have more than enough #1543

Comments

kasperske commented Oct 20, 2023

System Info

Information

Reproduction

Expected behavior

manyoso commented Oct 28, 2023

PHIL-GIBSON-1990 commented Oct 29, 2023

cebtenzzre commented Oct 29, 2023

BryceDrechselSmith commented Nov 13, 2023

kalle07 commented Nov 26, 2023

kalle07 commented Nov 28, 2023

cebtenzzre commented Nov 29, 2023

kalle07 commented Dec 1, 2023

ewebgh33 commented Dec 21, 2023

cebtenzzre commented Dec 21, 2023

ewebgh33 commented Dec 21, 2023 • edited

ewebgh33 commented Dec 21, 2023

cebtenzzre commented Dec 21, 2023

ewebgh33 commented Dec 21, 2023

BryceDrechselSmith commented Dec 21, 2023

This comment was marked as off-topic.

kalle07 commented Jan 20, 2024

fanoush commented Feb 15, 2024

kalle07 commented Feb 15, 2024

manyoso commented Mar 6, 2024

cebtenzzre commented Mar 6, 2024

manyoso commented Mar 6, 2024

fanoush commented Mar 10, 2024

gtbu commented Apr 19, 2024 • edited

Fallback to CPU with OOM even though GPU should have more than enough #1543

Fallback to CPU with OOM even though GPU should have more than enough #1543

ewebgh33 commented Dec 21, 2023 •

edited

gtbu commented Apr 19, 2024 •

edited