Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the memory problem #64

Open
K-Alex13 opened this issue Dec 19, 2023 · 3 comments
Open

about the memory problem #64

K-Alex13 opened this issue Dec 19, 2023 · 3 comments

Comments

@K-Alex13
Copy link

img_v3_0269_c20cbf2c-b81b-4866-914b-d470413adebg
Each time when I interact with the model, the memory occupied by the model increases and does not release memory resources. As a result, when there are many conversations, it is very easy for the model to crash. How can I solve this problem?

@hkvision
Copy link

You mean when you chat with the model, the memory keeps increasing but doesn't decrease after the chat finishes?

Could you provide more details? e.g. what model you are using, any specific the code for us to reproduce this?

@K-Alex13
Copy link
Author

The detail I can provide is that I do not put the embedding to cpu and I use Baichuan2 model. The main question is the don't release memory.
Following are code
model initial code:
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True,
load_in_4bit=True).bfloat16().eval()
model = model.to('xpu')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

chat code:
response = model.chat(tokenizer, content,stream=True) (just using the original baichuan code)

@Ariadne330
Copy link
Contributor

Ariadne330 commented Dec 26, 2023

I cannot reproduce your problem on Windows11 System. The memory used by CPU is quite stable as the chat stream going. Here are my steps:
HW & OS:13th Gen Intel(R) Core(TM) i9-13900K; Intel(R) Arc(TM) A770 Graphics; Windows 11
Test env: bigdl-llm 2.5.0b20231222
Note: All the results were tested without cpu embedding (which may cause more cpu usage).

Test codes

I verify the issue based on the codes provided in Baichuan2-13B-Chat repo

from bigdl.llm.transformers import AutoModelForCausalLM
import torch
import intel_extension_for_pytorch as ipex

import os
import platform
import subprocess
from colorama import Fore, Style
from tempfile import NamedTemporaryFile


model_path = "D:\llm-models\Baichuan2-13B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True,
                                             optimize_model=True).bfloat16().eval()

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)

model = model.to('xpu')

messages = []

while True:
    prompt = input(Fore.GREEN + Style.BRIGHT + "\n用户:" + Style.NORMAL)
    if prompt.strip() == "exit":
        break
    print(Fore.CYAN + Style.BRIGHT + "\nBaichuan 2:" + Style.NORMAL, end='')

    messages.append({"role": "user", "content": prompt})
    position = 0
    try:
        for response in model.chat(tokenizer, messages, stream=True):
            print(response[position:], end='', flush=True)
            position = len(response)
            torch.xpu.empty_cache()
    except KeyboardInterrupt:
        pass
    print()
    messages.append({"role": "assistant", "content": response})

Test results

I chat ten rounds with the model and append the history in chat API and didn't notice allocated memory increases. The memory increases fast when loading the model but maintain a relatively stable level(from 40s)during the chatting stage.
memory_usage_plot_load

Here's my code for memory capture and python script for memory usage plot.

PowerShell script for memory capture
    while($true) {
    Get-Process | Measure-Object -Property WS -Sum | ForEach-Object { "Total Memory Usage: $($_.Sum / 1MB) MB" } | Out-File test.log -Append
    Start-Sleep -Milliseconds 10
}
Python script for plotting the results
	import matplotlib.pyplot as plt
	
	data = []
	count = 0
	
	with open('./test.log', 'r',encoding="utf-16") as file:
	    for line in file.readlines()[:-2]:
	        mem = line.split()[3
	        data.append(float(mem))
	
	x = [i for i in range(len(data))]
	plt.plot(x, data, linestyle='-')
	plt.xlabel('Time')
	plt.ylabel('Used/MB')
	plt.title('Used Memory Over Time')
	plt.legend()
	plt.grid(True)
	plt.ylim(min(data)-100, max(data)+2000)
	
	plt.savefig('memory_usage_plot_load.png')

And GPU memory is at a stable level too.

Untitled (5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants