Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow generation #83

Open
jaslatendresse opened this issue Dec 13, 2023 · 1 comment
Open

Very slow generation #83

jaslatendresse opened this issue Dec 13, 2023 · 1 comment

Comments

@jaslatendresse
Copy link

I am running this on Mac M1 16GB RAM using app.py for simple text generation. Using the llama.cpp from terminal is much faster but when I use the backend through app.py is very slow. Any ideas?

@arnaudberenbaum
Copy link

Hello ! I was in the same situation and found the solution :

  1. Fist, check if your python env is configured to be arm64 and not x86 :
    python -c "import platform; print(platform.platform())"
    it shoud return :
    macOS-14.2.1-arm64-arm-64bit

  2. If it's not, you need to create a new env (i'm using Conda):
    CONDA_SUBDIR=osx-arm64 conda create -n your_env python=the_version_you_want

  3. You clone the github repo and install the package llama2-wraper:
    python -m pip install llama2-wrapper

  4. And then you reinstall the llama-ccp-python package for arm64:
    python -m pip uninstall llama-cpp-python -y
    CMAKE_ARGS="-DLLAMA_METAL=on" pip install -U llama-cpp-python --no-cache-dir
    python -m pip install 'llama-cpp-python[server]'

  5. When it's done, you need to modify the file "~/llama2-webui/llama2_wrapper/model.py":

  • in the function (line 118), you need to add the param "n_gpu_layers=-1" :
@classmethod
    def create_llama2_model(
        cls, model_path, backend_type, max_tokens, load_in_8bit, verbose
    ):
        if backend_type is BackendType.LLAMA_CPP:
            from llama_cpp import Llama
            model = Llama(
                model_path=model_path,
                n_ctx=max_tokens,
                n_batch=max_tokens,
                verbose=verbose,
                n_gpu_layers=-1 # I added this line to force the model to run on GPU ARM
  1. Profit ! It should be fast to generate content (on Macbook pro M1 pro 16G memory, It went to 1 token every 2 seconds to 10 tokens per second !

Hope it helped ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants