chatglm3-6b with fp8, 1k input, 512 output, and batch 64 failed by all-in-one benchmark tool #10818

Fred-cell · 2024-04-20T01:58:12Z

2024-04-20 02:23:18,386 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.50it/s]
2024-04-20 02:23:23,282 - INFO - Converting the current model to fp8_e5m2 format......
Convert model to half precision

loading of model costs 9.180109353968874s and 6.50390625GB
<class 'transformers_modules.chatglm3-6b.modeling_chatglm.ChatGLMForConditionalGeneration'>
/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/utils.py:80: UserWarning: BIGDL_QUANTIZE_KV_CACHE is deprecated and will be removed in future releases. Please use IPEX_LLM_QUANTIZE_KV_CACHE instead.
warnings.warn(
Exception in thread Thread-4 (run_model_in_thread):
Traceback (most recent call last):
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/run.py", line 52, in run_model_in_thread
output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate
return self.greedy_search(
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search
outputs = self(
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call
return self.model(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 937, in forward
transformer_outputs = self.transformer(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 169, in chatglm2_model_forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 640, in forward
layer_ret = layer(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 193, in chatglm2_attention_forward
return forward_function(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 275, in chatglm2_quantized_attention_forward_8eb45c
context_layer = F.scaled_dot_product_attention(query_layer, key,
RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 8.13 GiB (GPU 0; 15.59 GiB total capacity; 14.11 GiB already allocated; 14.67 GiB reserved in total by PyTorch)

The text was updated successfully, but these errors were encountered:

glorysdj assigned lalalapotter Apr 22, 2024

jason-dai added the user issue label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chatglm3-6b with fp8, 1k input, 512 output, and batch 64 failed by all-in-one benchmark tool #10818

chatglm3-6b with fp8, 1k input, 512 output, and batch 64 failed by all-in-one benchmark tool #10818

Fred-cell commented Apr 20, 2024

chatglm3-6b with fp8, 1k input, 512 output, and batch 64 failed by all-in-one benchmark tool #10818

chatglm3-6b with fp8, 1k input, 512 output, and batch 64 failed by all-in-one benchmark tool #10818

Comments

Fred-cell commented Apr 20, 2024