About the acclerate problem with xpu #66

K-Alex13 · 2023-12-20T08:25:29Z

After putting the model and inputs to xpu, the model is work now on intel laptop. But the inference time is about 588 seconds that is too long for me. I think maybe the gpu is not working right now, may I ask what is the problem here? Thank you very much for any response.

following is the code:
`import torch
import intel_extension_for_pytorch as ipex

from bigdl.llm.transformers import AutoModelForCausalLM, AutoModel
from transformers import AutoTokenizer

import time
import numpy as np

from gpu_benchmark_util import BenchmarkWrapper

model_path = r"D:\rag\test_api\Baichuan2-7B-Chat"

model_path = r"C:\Users\Administrator\yishuo\chatglm2-6b"

prompt = """ 你是human_prime2，你是一个高级智能实体，你融合了最先进的算法和深度学习网络，专为跨越星际的知识探索与智慧收集而设计。
你回答以下问题时必须跟哲学相结合，必须在15字内回答完，你会尽量参考知识库来回答。
以下是问题：请介绍钱.
以下是知识库:[{'对话': '什么是"帮费"？', '回复': '"帮费"是为中央各库采买物料时，为护送官员以及送部的饭食银拨配的额外款项。'}, {'对话': '怎么说？', '回复': '如果技术能够复制我们的外貌，它也许能够复制我们的思想和感受。'}, {'对话': '你好。', '回复': '嘿，你好！你看起来长得和我可真像啊！'}].
"""
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).bfloat16().eval()

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).eval()

input_ids = tokenizer.encode(prompt, return_tensors="pt")
print("finish to load")

model = model.to('xpu')

model.model.embed_tokens.to('cpu')

model.transformer.embedding.to('cpu')
input_ids = input_ids.to('xpu')

print("finish to xpu")

model = BenchmarkWrapper(model)

with torch.inference_mode():
# wamup two times as use ipex
for i in range(7):
st = time.time()
output = model.generate(input_ids, num_beams=1, do_sample=False, max_new_tokens=32)
end = time.time()
print(f'Inference time: {end-st} s')
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
`

K-Alex13 · 2023-12-20T08:41:57Z

It seems that the model is not working on xpu right now, can you please help me to deal this problem? Thank you very much for any response.

MeouSker77 · 2023-12-21T01:38:16Z

After putting the model and inputs to xpu, the model is work now on intel laptop. But the inference time is about 588 seconds that is too long for me. I think maybe the gpu is not working right now, may I ask what is the problem here? Thank you very much for any response.

see this section https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#best-known-configurations and follow it

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).bfloat16().eval()

remove .bfloat16() for better performance and add cpu_embedding=True instead of model.transformer.embedding.to('cpu')

also see https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py, while this example is for linux, cpu_embedding=True is also required on Windows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the acclerate problem with xpu #66

About the acclerate problem with xpu #66

K-Alex13 commented Dec 20, 2023

K-Alex13 commented Dec 20, 2023

MeouSker77 commented Dec 21, 2023

About the acclerate problem with xpu #66

About the acclerate problem with xpu #66

Comments

K-Alex13 commented Dec 20, 2023

from gpu_benchmark_util import BenchmarkWrapper

model_path = r"D:\rag\test_api\Baichuan2-7B-Chat"

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).eval()

model.model.embed_tokens.to('cpu')

model = BenchmarkWrapper(model)

K-Alex13 commented Dec 20, 2023

MeouSker77 commented Dec 21, 2023