Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Usage dropping before completion ends #669

Open
jeanromainroy opened this issue Apr 9, 2024 · 10 comments
Open

GPU Usage dropping before completion ends #669

jeanromainroy opened this issue Apr 9, 2024 · 10 comments

Comments

@jeanromainroy
Copy link

I have been using the new Command-R+ model in 4-bit mode and consistently observe a drop in GPU utilization immediately after prompt evaluation, as it begins generation/prediction. This leads to significantly reduced performance.

During evaluation:
Screenshot 2024-04-09 at 12 59 42 PM

During generation – drop occurs right before the first token is predicted (i.e. "<PAD>"):
Screenshot 2024-04-09 at 1 01 04 PM

Here's my setup:
Machine: Apple M2 Ultra (cores: 8E+16P+60GPU), 192GB Ram
ProductName: macOS
ProductVersion: 14.3
BuildVersion: 23D56

I have tried with and without setting my memory limit:
sudo sysctl iogpu.wired_lwm_mb=150000

I have tried with and without disabling the cache:
mx.metal.set_cache_limit(0)

Any help would be welcome, because at the moment I am only able to use the llama.cpp implementation of Command-R+, which works without any issues.

@awni
Copy link
Member

awni commented Apr 9, 2024

How long is that prompt? Do you mind copying it here in text form so I can try it directly?

@jeanromainroy
Copy link
Author

jeanromainroy commented Apr 9, 2024

I have tried it with long (many thousands of tokens) and short (~300 tokens) prompts. It produces the same issue. If you want to try my exact prompt here it is:

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>The following is a conversation with Andrew Bustamante, former CIA covert intelligence officer and U.S. Air Force combat veteran, including the job of operational targeting, encrypted communications, and launch operations for 200 nuclear intercontinental ballistic missiles. Andrews over seven years as a CIA spy have given him a skill set and a perspective on the world that is fascinating to explore. a quick few second mention of each sponsor. Check them out in the description. It's the best way to support this podcast. We've got wealth front for savings, element for hydration, better help for mental health, ExpressVPN for privacy and masterclass for intellectual inspiration. Choose wise and my friends. And now onto the full ad reads, never any ads in the middle. I hate those. And the ads I do hear up front, I do try to make interesting. you must skip them to your friends, please still check out the sponsors. I enjoy their stuff. Maybe you will too. This show is brought to you by a new sponsor called Wealthfront. They do savings and automated investing accounts to help you build wealth and save for the future. It's a beautifully designed and streamlined interface. It's honestly really surprising to me how many financial institutions, of all kinds, on the internet. internet don't have a good interface. It's clunky. I don't understand it. They're taking your money. Obviously, it should be frictionless to move your money around. Anyway, I think I have a lot of trouble with companies that don't do a good job with the interface and wealth front does a good job with that.\n\nSummarize the text above in one short paragraph.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

Or,

text_325_tokens = """The following is a conversation with Andrew Bustamante, former CIA covert intelligence officer and U.S. Air Force combat veteran, including the job of operational targeting, encrypted communications, and launch operations for 200 nuclear intercontinental ballistic missiles. Andrews over seven years as a CIA spy have given him a skill set and a perspective on the world that is fascinating to explore. a quick few second mention of each sponsor. Check them out in the description. It's the best way to support this podcast. We've got wealth front for savings, element for hydration, better help for mental health, ExpressVPN for privacy and masterclass for intellectual inspiration. Choose wise and my friends. And now onto the full ad reads, never any ads in the middle. I hate those. And the ads I do hear up front, I do try to make interesting. you must skip them to your friends, please still check out the sponsors. I enjoy their stuff. Maybe you will too. This show is brought to you by a new sponsor called Wealthfront. They do savings and automated investing accounts to help you build wealth and save for the future. It's a beautifully designed and streamlined interface. It's honestly really surprising to me how many financial institutions, of all kinds, on the internet. internet don't have a good interface. It's clunky. I don't understand it. They're taking your money. Obviously, it should be frictionless to move your money around. Anyway, I think I have a lot of trouble with companies that don't do a good job with the interface and wealth front does a good job with that."""

text = text_325_tokens

messages = [
    {"role": "user", "content": f"{text}\n\nSummarize the text above in one short paragraph."}
]

# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

(sorry for all the edits)

@awni
Copy link
Member

awni commented Apr 9, 2024

Can confirm.. it's really slow on the longer prompt.

@awni
Copy link
Member

awni commented Apr 9, 2024

@jeanromainroy if it's possible, can you try rebooting your machine. That seems to resolve the speed issue on my end. I can generate quite quickly with the prompt you provided.

@jeanromainroy
Copy link
Author

Rebooting sometimes works, but not always. I tried rebooting approximately 10 times, and it worked about half of the time.

I serve the mlx_lm.generate function through a Flask server. If I perform a fresh reboot that produces a working session, and launch my server, it works, but if I then restart my server, it stops working again.

@awni
Copy link
Member

awni commented Apr 9, 2024

Ok let me see if I can reproduce the bad state. Just starting and killing the flask server is enough to make it slow down? That's pretty wild.

@jeanromainroy
Copy link
Author

jeanromainroy commented Apr 9, 2024

Here's my code if it can save time:

# Import Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path

PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"

# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model_path_actual = get_model_path(PATH_MODEL)
model = load_model(model_path_actual)


def complete( prompt, temperature=0.1, n_predict=512 ):

    # Log
    print(f'INFO: Launching a completion prompt with temperature={temperature} and n_predict={n_predict}')

    # Generate
    response = mlx_lm.generate(
        model,
        tokenizer,
        prompt,
        temp=temperature,
        max_tokens=n_predict,
        verbose=True
    )

    return response
# Import Libraries
from flask import Flask, request, jsonify

# Create a Flask app
app = Flask(__name__)

@app.route('/health', methods=['GET'])
def func0():
    return jsonify({ "status": 'OK' }), 200


@app.route('/completion', methods=['POST'])
def func1():

    # Validate the request
    if not request.json:
        return jsonify({"message": "Request must be in JSON format"}), 400
    
    # Ensure the request contains the required field
    if "prompt" not in request.json:
        return jsonify({"message": "Request must contain a 'prompt' field"}), 400
    
    # Extract 
    data = request.json
    prompt = data["prompt"]
    n_predict = data["n_predict"] if "n_predict" in data else 512
    temperature = data["temperature"] if "temperature" in data else 0.1

    # Extract the dialogue from the transcript
    content = complete(prompt, temperature, n_predict)

    return jsonify({ "content": content }), 200

if __name__ == '__main__':
    app.run(port=8080, debug=True)

@awni
Copy link
Member

awni commented Apr 10, 2024

I ran the server / flask app you posted, then ctrl+c it. Then run the model regularly and it is the same speed (generating reasonably fast, e.g. about 7.5 tps).

I'm not sure how to reproduce this yet. Maybe you could share the exact sequence of commands you use or some non-sensitive version?

@armbues
Copy link

armbues commented Apr 19, 2024

I'm running into a similar slow-down with an M2 Ultra Mac Studio (60 GPUs, 192 GB Mem) with Llama-2 70B Q8.

This usually happens after loading and running different larger models. For this model, the TPS dropped from ~8 to under 1.

Rebooting seems to resolve the problem in my case.

@awni
Copy link
Member

awni commented Apr 19, 2024

Thanks for the data point. Still looking into a better solution for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants