Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatglm3-6b with fp8, 1k input, 512 output, and batch 64 failed by all-in-one benchmark tool #10818

Open
Fred-cell opened this issue Apr 20, 2024 · 0 comments
Assignees

Comments

@Fred-cell
Copy link

2024-04-20 02:23:18,386 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.50it/s]
2024-04-20 02:23:23,282 - INFO - Converting the current model to fp8_e5m2 format......
Convert model to half precision

loading of model costs 9.180109353968874s and 6.50390625GB
<class 'transformers_modules.chatglm3-6b.modeling_chatglm.ChatGLMForConditionalGeneration'>
/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/utils.py:80: UserWarning: BIGDL_QUANTIZE_KV_CACHE is deprecated and will be removed in future releases. Please use IPEX_LLM_QUANTIZE_KV_CACHE instead.
warnings.warn(
Exception in thread Thread-4 (run_model_in_thread):
Traceback (most recent call last):
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/run.py", line 52, in run_model_in_thread
output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate
return self.greedy_search(
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search
outputs = self(
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call
return self.model(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 937, in forward
transformer_outputs = self.transformer(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 169, in chatglm2_model_forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 640, in forward
layer_ret = layer(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 193, in chatglm2_attention_forward
return forward_function(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 275, in chatglm2_quantized_attention_forward_8eb45c
context_layer = F.scaled_dot_product_attention(query_layer, key,
RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 8.13 GiB (GPU 0; 15.59 GiB total capacity; 14.11 GiB already allocated; 14.67 GiB reserved in total by PyTorch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants