Custom Model Generator #933

shanyssys · 2024-02-27T06:17:31Z

shanyssys
Feb 27, 2024

Lately I've run into models that are having issues loading into Ollama, they usually end up with "out of memory". For example 7b models like Gemma using 3070ti.

The solution to load 7b models into ollama is to restrict the number of layers which pytorch is loading into the gpu,
this is done by using a flag PARAMETER num_gpu 25 in the Modelfile.

which will look like this when ollama runs the model in log:
llm_load_tensors: offloading 25 repeating layers to GPU

As a feature request, I'd like to see a "Set layers for model X" in Open-WebUI which then creates a new model file, uses the ollama create flag and have it all listed. this will help benchmark older cards with less gpu than just rely on the cpu with num_gpu 0 flag.

justinh-rahb · 2024-02-27T06:19:55Z

justinh-rahb
Feb 27, 2024
Collaborator

To clarify, you mean adding the field here in the Modelfile generator?

1 reply

Gally-Youko Apr 25, 2024

I want this too
#1749

shanytc · 2024-02-28T06:13:01Z

shanytc
Feb 28, 2024

Yeah, if possible to add the num_gpu.
If it's 0 it means cpu, otherwise it will load it the specified amount layers to the gpu

1 reply

gigascake Jun 4, 2024

Using this method, Enable Load Model to CPU(DRAM) + VRAM?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Model Generator #933

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Custom Model Generator #933

shanyssys Feb 27, 2024

Replies: 2 comments · 2 replies

justinh-rahb Feb 27, 2024 Collaborator

Gally-Youko Apr 25, 2024

shanytc Feb 28, 2024

gigascake Jun 4, 2024

shanyssys
Feb 27, 2024

Replies: 2 comments 2 replies

justinh-rahb
Feb 27, 2024
Collaborator

shanytc
Feb 28, 2024