Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gibberish output with Llama-2-7b-chat-hf-q4f32_1 #356

Open
beaufortfrancois opened this issue Apr 3, 2024 · 17 comments
Open

Gibberish output with Llama-2-7b-chat-hf-q4f32_1 #356

beaufortfrancois opened this issue Apr 3, 2024 · 17 comments

Comments

@beaufortfrancois
Copy link
Contributor

beaufortfrancois commented Apr 3, 2024

Chrome Version: 125.0.6283.3
OS: ChromeOS
GPU: Intel(R) Graphics (ADL GT2) - Intel open-source Mesa driver: Mesa 23.3.0 (git-5cb3f1e4fa)
Dawn Backend: Vulkan

What steps will reproduce the problem?

  1. Go to https://webllm.mlc.ai/#chat-demo
  2. Select Llama-2-7b-chat-hf-q4f32_1
  3. Enter What color is the dress?

What is the expected result?
Some text that at least makes sense.

What happens instead?
Some gibberish text appears.
DevTools JavaScript console contains the following logs:

llm_chat.ts:150 Using prefillChunkSize:  1024
llm_chat.ts:180 Using maxWindowLength:  4096
llm_chat.ts:202 Using Paged KVCache
15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

Then I enter "What color is the dress?"

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.
/#chat-demo:1 WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.

Note
It does work properly with the following f16 variants: Llama-2-7b-chat-hf-q4f16_1 and Llama-2-7b-chat-hf-q4f16_1-1k
I can reproduce with Llama-2-13b-chat-hf-q4f16_1

image

@CharlieFRuan
Copy link
Contributor

Thanks for reporting the issue, this seems to be an out-of-memory issue (f32 KV cache, and 13b params); llama-2-7b-q4f32_1 requires roughly 9 GB, while 13b-q4f16_1 requires roughly 10 GB.

How much RAM does Intel(R) Graphics (ADL GT2) have? Is it 16 GB? It might be a bit hard to catch the OOM error as we've seen earlier in #209.

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Apr 3, 2024

Similar VK_ERROR_OUT_OF_DEVICE_MEMORY issue was reported in mlc-llm: mlc-ai/mlc-llm#974

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Apr 12, 2024

I think we should catch GPU out-of-memory errors like we tried previously in #209 (comment).

FYI I was not able to catch them with https://chromewebstore.google.com/detail/webgpu-dev-extension/gkeaijopdhfempknmaggbjbedicopjgm either @greggman.

EDIT: The reason why is because the extension doesn't support workers.

@beaufortfrancois
Copy link
Contributor Author

FYI https://webgpureport.org says the integrated GPU memoryHeaps is [ size: 8269717504, properties: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | HOST_UNCACHED | HOST_CACHED ] which suggests, I believe, this ChromeOS device can use up to 7.7 Go of memory.

@beaufortfrancois
Copy link
Contributor Author

@CharlieFRuan Are out-of-memory errors captured somewhere? In WebLLM or Apache TVM?

@CharlieFRuan
Copy link
Contributor

I know TVM can capture OOM for other backends (e.g. for Vulkan here). I'm not too sure what would be the case for webgpu. I'll make another attempt this week; thanks for the pointers!

@beaufortfrancois
Copy link
Contributor Author

@tqchen
Copy link
Contributor

tqchen commented Apr 16, 2024

I think webllm would needs its own mechanism. There are a few things.

  • First of all, check whether webgpu buffer creation can throw an error, or caught by error scope (based on our previous trial seems it is not always the case)
  • One approach might be to have TVM's GPU adapter to track the buffer allocated/deallocated, and throw after it gets to a cap

@beaufortfrancois
Copy link
Contributor Author

@CharlieFRuan Did you have a chance to have a look at this?

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Apr 22, 2024

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

image

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Apr 23, 2024

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

According to #356 (comment) logs, it looks like errors happen when validating entries in createBindGroup(), not after createBuffer(). Does it help?

Did you try uncapturederror as well?

device.onuncapturederror = ({error}) => {
  console.log(error);
})

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

That's useful. Thanks!

@beaufortfrancois
Copy link
Contributor Author

(gentle ping)

@beaufortfrancois
Copy link
Contributor Author

@CharlieFRuan Did you have a chance to look at this?

@CharlieFRuan
Copy link
Contributor

Sorry for the delay, will take a look tonight

@CharlieFRuan
Copy link
Contributor

Quick update: it does seem that the error can be caught! Not sure if I did something wrong earlier or there are some updates on the webgpu side.

Since my laptop does not run into OOM for most models, to reproduce the error, I set maxTotalSeqLen to an arbitrary large number 909600, as opposed to default values like 4k or 1k, this forces the engine to allocate a very large KVCache. Not sure if this is equivalent to the behavior of loading a model that is too large for the device (but should be quite similar).

Upon finishing loading the model, the engine will allocate the KVCache, and I see:
image

This log corresponds to the push and pop of ErrorScope I added here in tvm/web:
image

Then upon ignoring the error and start chatting, we hit the uncaptured error you suggested:
image

I will refine the handling and upstream the changes after verifying the errors can indeed be well-caught. Should have another update by the end of this week. Thank you so much for the help!

CharlieFRuan added a commit that referenced this issue May 21, 2024
Prior to this PR, when users `createEngine()` or call `reload()` with a
model that is too large for the device, likely the device would keep
generating, ignoring OOM issue and correctness. See
#356 and
#209.

This PR catches such error with `device.lost.then()`, depending on tvmjs
to call `device.destroy()` upon detecting error in `createBuffer()` via
apache/tvm#17005.

We have only observed `createBuffer()` errors and hence will only
process such kind of errors for now. Besides, since most OOM errors
occur in `reload()`, we make the error handling synchronous despite
using `.then()` by throwing the error at the end of `reload()` if there
is one.
@CharlieFRuan
Copy link
Contributor

OOM errors in createBuffer() are now able to be caught in webllm npm 0.2.36 via #402. If createBuffer() does not suffice, will follow up with more error catching in tvmjs. Thanks!

Redeployed https://webllm.mlc.ai/ as well.

@beaufortfrancois
Copy link
Contributor Author

This improvement is so much better!
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants