You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-04-20 02:23:18,386 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.50it/s]
2024-04-20 02:23:23,282 - INFO - Converting the current model to fp8_e5m2 format......
Convert model to half precision
loading of model costs 9.180109353968874s and 6.50390625GB
<class 'transformers_modules.chatglm3-6b.modeling_chatglm.ChatGLMForConditionalGeneration'>
/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/utils.py:80: UserWarning: BIGDL_QUANTIZE_KV_CACHE is deprecated and will be removed in future releases. Please use IPEX_LLM_QUANTIZE_KV_CACHE instead.
warnings.warn(
Exception in thread Thread-4 (run_model_in_thread):
Traceback (most recent call last):
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/run.py", line 52, in run_model_in_thread
output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate
return self.greedy_search(
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search
outputs = self(
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call
return self.model(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 937, in forward
transformer_outputs = self.transformer(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 169, in chatglm2_model_forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 640, in forward
layer_ret = layer(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 193, in chatglm2_attention_forward
return forward_function(
File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/ipex_llm/transformers/models/chatglm2.py", line 275, in chatglm2_quantized_attention_forward_8eb45c
context_layer = F.scaled_dot_product_attention(query_layer, key,
RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 8.13 GiB (GPU 0; 15.59 GiB total capacity; 14.11 GiB already allocated; 14.67 GiB reserved in total by PyTorch)
The text was updated successfully, but these errors were encountered:
2024-04-20 02:23:18,386 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.50it/s]
2024-04-20 02:23:23,282 - INFO - Converting the current model to fp8_e5m2 format......
Convert model to half precision
The text was updated successfully, but these errors were encountered: