Gibberish output with `Llama-2-7b-chat-hf-q4f32_1` #356

beaufortfrancois · 2024-04-03T07:09:52Z

Chrome Version: 125.0.6283.3
OS: ChromeOS
GPU: Intel(R) Graphics (ADL GT2) - Intel open-source Mesa driver: Mesa 23.3.0 (git-5cb3f1e4fa)
Dawn Backend: Vulkan

What steps will reproduce the problem?

Go to https://webllm.mlc.ai/#chat-demo
Select Llama-2-7b-chat-hf-q4f32_1
Enter What color is the dress?

What is the expected result?
Some text that at least makes sense.

What happens instead?
Some gibberish text appears.
DevTools JavaScript console contains the following logs:

llm_chat.ts:150 Using prefillChunkSize:  1024
llm_chat.ts:180 Using maxWindowLength:  4096
llm_chat.ts:202 Using Paged KVCache
15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

Then I enter "What color is the dress?"

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.
/#chat-demo:1 WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.

Note
It does work properly with the following f16 variants: Llama-2-7b-chat-hf-q4f16_1 and Llama-2-7b-chat-hf-q4f16_1-1k
I can reproduce with Llama-2-13b-chat-hf-q4f16_1

The text was updated successfully, but these errors were encountered:

CharlieFRuan · 2024-04-03T19:00:23Z

Thanks for reporting the issue, this seems to be an out-of-memory issue (f32 KV cache, and 13b params); llama-2-7b-q4f32_1 requires roughly 9 GB, while 13b-q4f16_1 requires roughly 10 GB.

How much RAM does Intel(R) Graphics (ADL GT2) have? Is it 16 GB? It might be a bit hard to catch the OOM error as we've seen earlier in #209.

CharlieFRuan · 2024-04-03T19:01:04Z

Similar VK_ERROR_OUT_OF_DEVICE_MEMORY issue was reported in mlc-llm: mlc-ai/mlc-llm#974

beaufortfrancois · 2024-04-12T09:27:37Z

I think we should catch GPU out-of-memory errors like we tried previously in #209 (comment).

FYI I was not able to catch them with https://chromewebstore.google.com/detail/webgpu-dev-extension/gkeaijopdhfempknmaggbjbedicopjgm either @greggman.

EDIT: The reason why is because the extension doesn't support workers.

beaufortfrancois · 2024-04-12T13:12:42Z

FYI https://webgpureport.org says the integrated GPU memoryHeaps is [ size: 8269717504, properties: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | HOST_UNCACHED | HOST_CACHED ] which suggests, I believe, this ChromeOS device can use up to 7.7 Go of memory.

beaufortfrancois · 2024-04-15T07:53:38Z

@CharlieFRuan Are out-of-memory errors captured somewhere? In WebLLM or Apache TVM?

CharlieFRuan · 2024-04-15T16:41:50Z

I know TVM can capture OOM for other backends (e.g. for Vulkan here). I'm not too sure what would be the case for webgpu. I'll make another attempt this week; thanks for the pointers!

beaufortfrancois · 2024-04-16T08:03:20Z

https://github.com/search?q=repo%3Aapache%2Ftvm+%22out-of-memory%22&type=code returns no results for me ;(

tqchen · 2024-04-16T12:33:23Z

I think webllm would needs its own mechanism. There are a few things.

First of all, check whether webgpu buffer creation can throw an error, or caught by error scope (based on our previous trial seems it is not always the case)
One approach might be to have TVM's GPU adapter to track the buffer allocated/deallocated, and throw after it gets to a cap

beaufortfrancois · 2024-04-22T07:27:33Z

@CharlieFRuan Did you have a chance to have a look at this?

CharlieFRuan · 2024-04-22T17:14:04Z

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

beaufortfrancois · 2024-04-23T10:59:13Z

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

According to #356 (comment) logs, it looks like errors happen when validating entries in createBindGroup(), not after createBuffer(). Does it help?

Did you try uncapturederror as well?

device.onuncapturederror = ({error}) => {
  console.log(error);
})

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

That's useful. Thanks!

beaufortfrancois · 2024-05-02T05:40:28Z

(gentle ping)

beaufortfrancois · 2024-05-13T09:58:39Z

@CharlieFRuan Did you have a chance to look at this?

CharlieFRuan · 2024-05-14T05:43:08Z

Sorry for the delay, will take a look tonight

CharlieFRuan · 2024-05-14T20:04:20Z

Quick update: it does seem that the error can be caught! Not sure if I did something wrong earlier or there are some updates on the webgpu side.

Since my laptop does not run into OOM for most models, to reproduce the error, I set maxTotalSeqLen to an arbitrary large number 909600, as opposed to default values like 4k or 1k, this forces the engine to allocate a very large KVCache. Not sure if this is equivalent to the behavior of loading a model that is too large for the device (but should be quite similar).

Upon finishing loading the model, the engine will allocate the KVCache, and I see:

This log corresponds to the push and pop of ErrorScope I added here in tvm/web:

Then upon ignoring the error and start chatting, we hit the uncaptured error you suggested:

I will refine the handling and upstream the changes after verifying the errors can indeed be well-caught. Should have another update by the end of this week. Thank you so much for the help!

Prior to this PR, when users `createEngine()` or call `reload()` with a model that is too large for the device, likely the device would keep generating, ignoring OOM issue and correctness. See #356 and #209. This PR catches such error with `device.lost.then()`, depending on tvmjs to call `device.destroy()` upon detecting error in `createBuffer()` via apache/tvm#17005. We have only observed `createBuffer()` errors and hence will only process such kind of errors for now. Besides, since most OOM errors occur in `reload()`, we make the error handling synchronous despite using `.then()` by throwing the error at the end of `reload()` if there is one.

CharlieFRuan · 2024-05-21T22:06:43Z

OOM errors in createBuffer() are now able to be caught in webllm npm 0.2.36 via #402. If createBuffer() does not suffice, will follow up with more error catching in tvmjs. Thanks!

Redeployed https://webllm.mlc.ai/ as well.

beaufortfrancois · 2024-05-23T12:21:41Z

This improvement is so much better!

This was referenced May 17, 2024

[WebGPU] Handle device OOM in createBuffer apache/tvm#17005

Merged

[Device] Catch WebGPU OOM error #402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gibberish output with `Llama-2-7b-chat-hf-q4f32_1` #356

Gibberish output with `Llama-2-7b-chat-hf-q4f32_1` #356

beaufortfrancois commented Apr 3, 2024 •

edited

CharlieFRuan commented Apr 3, 2024

CharlieFRuan commented Apr 3, 2024 •

edited

beaufortfrancois commented Apr 12, 2024 •

edited

beaufortfrancois commented Apr 12, 2024

beaufortfrancois commented Apr 15, 2024

CharlieFRuan commented Apr 15, 2024

beaufortfrancois commented Apr 16, 2024

tqchen commented Apr 16, 2024

beaufortfrancois commented Apr 22, 2024

CharlieFRuan commented Apr 22, 2024 •

edited

beaufortfrancois commented Apr 23, 2024 •

edited

beaufortfrancois commented May 2, 2024

beaufortfrancois commented May 13, 2024

CharlieFRuan commented May 14, 2024

CharlieFRuan commented May 14, 2024

CharlieFRuan commented May 21, 2024

beaufortfrancois commented May 23, 2024

Gibberish output with Llama-2-7b-chat-hf-q4f32_1 #356

Gibberish output with Llama-2-7b-chat-hf-q4f32_1 #356

Comments

beaufortfrancois commented Apr 3, 2024 • edited

CharlieFRuan commented Apr 3, 2024

CharlieFRuan commented Apr 3, 2024 • edited

beaufortfrancois commented Apr 12, 2024 • edited

beaufortfrancois commented Apr 12, 2024

beaufortfrancois commented Apr 15, 2024

CharlieFRuan commented Apr 15, 2024

beaufortfrancois commented Apr 16, 2024

tqchen commented Apr 16, 2024

beaufortfrancois commented Apr 22, 2024

CharlieFRuan commented Apr 22, 2024 • edited

beaufortfrancois commented Apr 23, 2024 • edited

beaufortfrancois commented May 2, 2024

beaufortfrancois commented May 13, 2024

CharlieFRuan commented May 14, 2024

CharlieFRuan commented May 14, 2024

CharlieFRuan commented May 21, 2024

beaufortfrancois commented May 23, 2024

Gibberish output with `Llama-2-7b-chat-hf-q4f32_1` #356

Gibberish output with `Llama-2-7b-chat-hf-q4f32_1` #356

beaufortfrancois commented Apr 3, 2024 •

edited

CharlieFRuan commented Apr 3, 2024 •

edited

beaufortfrancois commented Apr 12, 2024 •

edited

CharlieFRuan commented Apr 22, 2024 •

edited

beaufortfrancois commented Apr 23, 2024 •

edited