GPU Usage dropping before completion ends #669

jeanromainroy · 2024-04-09T17:06:16Z

I have been using the new Command-R+ model in 4-bit mode and consistently observe a drop in GPU utilization immediately after prompt evaluation, as it begins generation/prediction. This leads to significantly reduced performance.

During evaluation:

During generation – drop occurs right before the first token is predicted (i.e. "<PAD>"):

Here's my setup:
Machine: Apple M2 Ultra (cores: 8E+16P+60GPU), 192GB Ram
ProductName: macOS
ProductVersion: 14.3
BuildVersion: 23D56

I have tried with and without setting my memory limit:
sudo sysctl iogpu.wired_lwm_mb=150000

I have tried with and without disabling the cache:
mx.metal.set_cache_limit(0)

Any help would be welcome, because at the moment I am only able to use the llama.cpp implementation of Command-R+, which works without any issues.

The text was updated successfully, but these errors were encountered:

awni · 2024-04-09T17:07:55Z

How long is that prompt? Do you mind copying it here in text form so I can try it directly?

jeanromainroy · 2024-04-09T17:14:04Z

I have tried it with long (many thousands of tokens) and short (~300 tokens) prompts. It produces the same issue. If you want to try my exact prompt here it is:

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>The following is a conversation with Andrew Bustamante, former CIA covert intelligence officer and U.S. Air Force combat veteran, including the job of operational targeting, encrypted communications, and launch operations for 200 nuclear intercontinental ballistic missiles. Andrews over seven years as a CIA spy have given him a skill set and a perspective on the world that is fascinating to explore. a quick few second mention of each sponsor. Check them out in the description. It's the best way to support this podcast. We've got wealth front for savings, element for hydration, better help for mental health, ExpressVPN for privacy and masterclass for intellectual inspiration. Choose wise and my friends. And now onto the full ad reads, never any ads in the middle. I hate those. And the ads I do hear up front, I do try to make interesting. you must skip them to your friends, please still check out the sponsors. I enjoy their stuff. Maybe you will too. This show is brought to you by a new sponsor called Wealthfront. They do savings and automated investing accounts to help you build wealth and save for the future. It's a beautifully designed and streamlined interface. It's honestly really surprising to me how many financial institutions, of all kinds, on the internet. internet don't have a good interface. It's clunky. I don't understand it. They're taking your money. Obviously, it should be frictionless to move your money around. Anyway, I think I have a lot of trouble with companies that don't do a good job with the interface and wealth front does a good job with that.\n\nSummarize the text above in one short paragraph.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

Or,

text_325_tokens = """The following is a conversation with Andrew Bustamante, former CIA covert intelligence officer and U.S. Air Force combat veteran, including the job of operational targeting, encrypted communications, and launch operations for 200 nuclear intercontinental ballistic missiles. Andrews over seven years as a CIA spy have given him a skill set and a perspective on the world that is fascinating to explore. a quick few second mention of each sponsor. Check them out in the description. It's the best way to support this podcast. We've got wealth front for savings, element for hydration, better help for mental health, ExpressVPN for privacy and masterclass for intellectual inspiration. Choose wise and my friends. And now onto the full ad reads, never any ads in the middle. I hate those. And the ads I do hear up front, I do try to make interesting. you must skip them to your friends, please still check out the sponsors. I enjoy their stuff. Maybe you will too. This show is brought to you by a new sponsor called Wealthfront. They do savings and automated investing accounts to help you build wealth and save for the future. It's a beautifully designed and streamlined interface. It's honestly really surprising to me how many financial institutions, of all kinds, on the internet. internet don't have a good interface. It's clunky. I don't understand it. They're taking your money. Obviously, it should be frictionless to move your money around. Anyway, I think I have a lot of trouble with companies that don't do a good job with the interface and wealth front does a good job with that."""

text = text_325_tokens

messages = [
    {"role": "user", "content": f"{text}\n\nSummarize the text above in one short paragraph."}
]

# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

(sorry for all the edits)

awni · 2024-04-09T20:41:40Z

Can confirm.. it's really slow on the longer prompt.

awni · 2024-04-09T21:58:00Z

@jeanromainroy if it's possible, can you try rebooting your machine. That seems to resolve the speed issue on my end. I can generate quite quickly with the prompt you provided.

jeanromainroy · 2024-04-09T22:04:43Z

Rebooting sometimes works, but not always. I tried rebooting approximately 10 times, and it worked about half of the time.

I serve the mlx_lm.generate function through a Flask server. If I perform a fresh reboot that produces a working session, and launch my server, it works, but if I then restart my server, it stops working again.

awni · 2024-04-09T22:17:22Z

Ok let me see if I can reproduce the bad state. Just starting and killing the flask server is enough to make it slow down? That's pretty wild.

jeanromainroy · 2024-04-09T22:21:42Z

Here's my code if it can save time:

# Import Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path

PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"

# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model_path_actual = get_model_path(PATH_MODEL)
model = load_model(model_path_actual)


def complete( prompt, temperature=0.1, n_predict=512 ):

    # Log
    print(f'INFO: Launching a completion prompt with temperature={temperature} and n_predict={n_predict}')

    # Generate
    response = mlx_lm.generate(
        model,
        tokenizer,
        prompt,
        temp=temperature,
        max_tokens=n_predict,
        verbose=True
    )

    return response

# Import Libraries
from flask import Flask, request, jsonify

# Create a Flask app
app = Flask(__name__)

@app.route('/health', methods=['GET'])
def func0():
    return jsonify({ "status": 'OK' }), 200


@app.route('/completion', methods=['POST'])
def func1():

    # Validate the request
    if not request.json:
        return jsonify({"message": "Request must be in JSON format"}), 400
    
    # Ensure the request contains the required field
    if "prompt" not in request.json:
        return jsonify({"message": "Request must contain a 'prompt' field"}), 400
    
    # Extract 
    data = request.json
    prompt = data["prompt"]
    n_predict = data["n_predict"] if "n_predict" in data else 512
    temperature = data["temperature"] if "temperature" in data else 0.1

    # Extract the dialogue from the transcript
    content = complete(prompt, temperature, n_predict)

    return jsonify({ "content": content }), 200

if __name__ == '__main__':
    app.run(port=8080, debug=True)

awni · 2024-04-10T19:59:15Z

I ran the server / flask app you posted, then ctrl+c it. Then run the model regularly and it is the same speed (generating reasonably fast, e.g. about 7.5 tps).

I'm not sure how to reproduce this yet. Maybe you could share the exact sequence of commands you use or some non-sensitive version?

armbues · 2024-04-19T19:36:43Z

I'm running into a similar slow-down with an M2 Ultra Mac Studio (60 GPUs, 192 GB Mem) with Llama-2 70B Q8.

This usually happens after loading and running different larger models. For this model, the TPS dropped from ~8 to under 1.

Rebooting seems to resolve the problem in my case.

awni · 2024-04-19T19:38:01Z

Thanks for the data point. Still looking into a better solution for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Usage dropping before completion ends #669

GPU Usage dropping before completion ends #669

jeanromainroy commented Apr 9, 2024

awni commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024 •

edited

awni commented Apr 9, 2024

awni commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024

awni commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024 •

edited

awni commented Apr 10, 2024

armbues commented Apr 19, 2024

awni commented Apr 19, 2024

GPU Usage dropping before completion ends #669

GPU Usage dropping before completion ends #669

Comments

jeanromainroy commented Apr 9, 2024

awni commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024 • edited

awni commented Apr 9, 2024

awni commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024

awni commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024 • edited

awni commented Apr 10, 2024

armbues commented Apr 19, 2024

awni commented Apr 19, 2024

jeanromainroy commented Apr 9, 2024 •

edited

jeanromainroy commented Apr 9, 2024 •

edited