Skip to content

LLM bootstrap loader for local CPU/GPU inference with fully customizable chat.

Notifications You must be signed in to change notification settings

B0-B/blowtorch-transformer-api

Repository files navigation

blowtorch


LLM bootstrap loader for local CPU/GPU inference with fully customizable chat.


To give a taste..

from blowtorch import client, webUI

USERNAME = 'Steve'

# create state-of-the-art chat bot
myChatClient = client(model_file='Meta-Llama-3-8B-Instruct.Q2_K.gguf', 
                    hugging_face_path='MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF', 
                    chat_format="llama-3",
                    device="cpu")

myChatClient.setConfig(username=USERNAME)

# expose chat in web UI
webUI(myChatClient)

Updates

April 27, 2024

  • Now LLaMA-3 support! See the example. 🦙🦙🦙
  • Setup will install latest xformers.

Features

  • Simple to install with automated setup and model setup in just 2 lines.
  • Supports various LLaMA versions for prompting in different formats and manages corresponding arguments (transformers, llama.cpp).
  • Works for Windows (only CPU tested) and Linux (CPU/GPU)
  • PyTorch and transformer compliant - respects keyword arguments from e.g. transformer.pipeline, with auto conversion to llama.cpp
  • Creates a customizable chat bot with specified character.
  • Easy to understand, with few objects which can handle any case.
  • Loads models directly from huggingface and store them in local cache.
  • Has automatic fallbacks for different weight formats (e.g. GGML, GGUF, bin, ..)

Base Requirements

  • Python >=3.10.12
  • A system with a CPU (preferably Ryzen) and >=16GB RAM
  • Assumes drivers were correctly installed and GPU is detectable via rocm-smi, nvidia-smi etc.
  • A solid GPT chat requires >=6GB of RAM/vRAM depending on device.

Dependency for performant CPU inference [Default]

This project used to leverage ctransformers as GGML library for loading GGUF file format. But due to inactivity and incompatibility with new LLaMA-3 release the backend switched to llama-cpp-python project. This python API provides c-bindings for llama.cpp.

blowtorch uses llama.cpp in parallel to classic transformers for more and better onboarding options with CPU focus and quantized models.

library version
transformers 4.37.2
llama-cpp-python latest
accelerate 0.30.0
h5py 3.9.0
psutil latest
optimum latest
auto-gptq 0.7.1
ctransformers deprecated

Tests

Vendor Device Model Quality Assurance
AMD GPU MI300x âś…
AMD GPU RDNA3 âś…
AMD GPU RDNA2 âś…
AMD GPU RDNA1 âś…
AMD CPU Ryzen 3950x âś…

Tested Models

Model recommended device
MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF CPU
NousResearch/Llama-2-7b-chat-hf GPU
NousResearch/Llama-2-7b-chat-hf GPU
TheBloke/Llama-2-7B-Chat-GGUF CPU
TheBloke/Llama-2-7b-Chat-GPTQ GPU
TheBloke/Mistral-7B-Instruct-v0.2-GGUF CPU
TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ GPU/CPU

Docs

Setup

PIP Wheel

Will automatically install latest pre-built release

pip install https://b0-b.github.io/blowtorch-transformer-api/dist/blowtorch-1.2.2-py3-none-any.whl

Manual Installation

Clone the repository

git clone https://github.com/B0-B/blowtorch-transformer-api.git
cd blowtorch-transformer-api

Install the provided wheel distribution via python script

python install.py 

or with pip package manager

pip install ./dist/blowtorch-1.2.2-py3-none-any.whl 

Alternatively, if a hardware specific build is needed just build from source using automated script.

python rebuild.py

Note: This will create a new package wheel in the ./dist branch with your current settings. To install the build run python install.py. To build and directly install subsequently run

python rebuild.py && python install.py

GPU & BLAS Backends for llama.cpp

blowtorch distinguishes between model formats suited for CPU or GPU. If GPU is selected it will out-of-the-box attempt to load it with transformers (if suited) which leverages the default torch BLAS backend. If you intend to load a GGUF model on GPU however, blowtorch will try to load it with llama.cpp. For this re-build llama-cpp-python with the corresponding BLAS (linear algebra instruction) backend. You can find the full build instructions in abetlen/llama-cpp-python or the summarized commands below

# CPU acceleration on MacOS/Linux
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
# CPU acceleration on Windows
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

# ROCm hipBLAS on MacOS/Linux
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
# ROCm hipBLAS on Windows
$env:CMAKE_ARGS = "-DLLAMA_HIPBLAS=on"
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

# CUDA on Linux
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
# CUDA on MacOS
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
# CUDA on Windows
$env:CMAKE_ARGS = "-DLLAMA_CUDA=on" 
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Usage

Getting-Started

Blowtorch builds on the client analogy where the model and necessary parameters are held by one object blowtorch.client. The client is the main object which will allow to do all manipulations and settings for our model, like LLM transformer parameters, a name and character etc.

By default, if no huggingface model was specified, blowtorch will load a slim model called Writer/palmyra-small, which is good for pure testing and can be considered the simplest test

from blowtorch import client
client(device='cpu')

Generally, LLMs are designed to predict the next word in a sequence. Loading an LLM and generating from inputs like a started sentence, it will try to finish the sentence. For a chat-like experience, blowtorch exploits and tracks the context and initializes the chat with attributes (and character), which allows the AI to track the context and reason accordingly.

First, to download and run an arbitrary huggingface model

cl = client(hugging_face_path='TheBloke/Llama-2-7b-Chat-GPTQ', 
            name='GPT',
            device='gpu', # <-- select GPU as device
            device_id=0,  # <-- optionally select the GPU id
            model_type="llama",
            trust_remote_code=False,
            revision="main")

also, you can give your client a name, model_type (should match the current model), and it's possible to pre-define some transformers kwargs, but those can be overriden by cli or chat method kwargs.

For a gpt-chat in the console one should either use the chat method

cl.chat(
    max_new_tokens=128, 
    char_tags=[
        'polite',
        'focused and helpful',
        'expert in programing',
        'obedient'
    ], 
    username='Human',  
    temperature=0.8, 
    repetition_penalty=1.1)

or expose with an expose object.

Expose Objects for your Chat

The two main ways to expose your chat are

  • console - which runs in the console (terminal) of your current runtime. Alias for client.chat.
  • webUI - which starts a webserver with hosted UI in the browser

which can be imported in your python project

from blowtorch import console, webUI

As shown in this snippet, blowtorch.console object can be used as an alias for blowtorch.chat method but demands setting a config apriori. The chat arguments can also be pre-loaded (often useful) with the setConfig method. Then all other methods (like chat) or exposing objects require no arguments anymore. Note, variables do_sample, temperature, repetition_penalty are additional transformer kwargs, that will be accepted as well.

cl = client('llama-2-7b-chat.Q2_K.gguf', 
            'TheBloke/Llama-2-7B-Chat-GGUF', 
            name='AI',
            device='cpu', 
            model_type="llama",
            max_new_tokens = 1000,
            context_length = 6000)

# it is recommended to first set the config
cl.setConfig(
    char_tags=[
        'carring comrade',
        'polite',
        'focused and helpful',
        'expert in programing',
        'obedient'
    ], 
    username='Human',
    temperature=0.8, 
    repetition_penalty=1.1
)

cl.chat() # no arguments needed anymore

console(cl) # equivalent call to cl.chat()

Once the configuration of a client is setup, it may be exposed via a web server for a better GUI (for more info see web UI section)

cl.setConfig(
    char_tags=[
        'carring comrade',
        'polite',
        'focused and helpful',
        'expert in programing',
        'obedient'
    ], 
    username='Human',
    temperature=0.8, 
    repetition_penalty=1.1
)

# expose web service
from blowtorch import webUI
webUI(cl)
Chat API

The following is an example of loading quantization level, like a model file (from huggingface card) in GGUF format. If CPU is used, as GGML models are really well suited for CPU, it will be automatically loaded with llama.cpp in a single line.

blowtorch's chat method tracks the context, so the LLM can argue and even write code. For better output increase the max token size.

from blowtorch import client
client('llama-2-7b-chat.Q2_K.gguf', 'TheBloke/Llama-2-7B-Chat-GGUF', 'cpu', model_type="llama").chat(max_new_tokens=512, temperature=0.8, repetition_penalty=1.1)
Human: please create a python script which loads a huggingface model into transformers.
AI Assistant: Of course! To load a Hugging Face model into Transformers, you will need to use the `transformers` library. Here's an example of how you can do this:
```
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification# Load the Hugging Face model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)# Create a new dataset for training
train_data = pd.DataFrame({"text": ["This is a test sentence."], "label": [1]})# Train the model
model = AutoModelForSequenceClassification.from_pretrained(tokenizer, train_data=train_data)
```
Please let me know if you have any questions or need further assistance.
(82.27s)

An attempt to write a letter:

human: Write a letter of a russia soldier who writes a letter from germany in 1945 to his mother in Russia.

Llama-2-7B-Chat-GGUF:  Unterscheidung von der Ukraine und Belarus.
Dear Mother,
I hope this letter finds you in good health and spirits. I am writing to you from Germany, where I have been deployed with my unit since the beginning of this year. It has been a long and 
difficult journey, but I am proud to be serving my country in this time of need.
The war in Europe is coming to an end, and we are close to victory. I can see the end of the tunnel, and I know that soon we will be able to return home and rebuild our country. I am proud of what we have accomplished so far, and I know that our sacrifices will not be forgotten.
But even in the midst of all this chaos and destruction, I cannot help but think of you and father back in Russia. I miss you both dearly, and I wish you were here with me to share in this momentous occasion. I know that you are proud of me too, and I hope that you are taking care of yourselves and staying safe during these difficult times.
Mother, I want you to know that I have seen many things on this journey that I will never forget. The sights and sounds of war are something that no one should ever have to experience, but I know that it is something that I will always carry with me. I have seen the worst of humanity, but I have also seen the best. I have seen people come together in ways that I never thought possible, and I have seen the resilience and strength of the human spirit.
I know that this war will soon be over, and I will be returning home to you and father. I cannot wait to hold you both in my arms again and to start rebuilding our lives together. Until then, know that I love you both more than anything in the world, and that I will always be with you in spirit.
Your loving son,
[Soldier's Name]
Custom GPT Chat

Char Tags

The chat function of blowtorch can create a gpt-like chatbot, with a specified character.

User: Hello, AI.
AI: Hello! How can I assist you today?
human: can you help me a physics question?       
AI: Of course, I'd be happy to help! What is the question?
human: Can you explain me Ehrnfest's theorem?
AI: Of course, I'd be happy to help! Ehrnfest's Theorem states that if two functions are continuous on the same interval, then their compositions are also continuous on that interval. Let me know if you have any questions or need further clarification.

Also blowtorch can impersonate people, like well known celebrities, here is an example of a cheeky chatbot who talks like Arnold Schwarzenegger

from blowtorch import client

client('llama-2-7b-chat.Q2_K.gguf', 
    'TheBloke/Llama-2-7B-Chat-GGUF', 
    name='Arnold',
    device='cpu', 
    model_type="llama"
).chat(
    max_new_tokens=128, 
    char_tags=['funnily impersonates Arnold Schwarzenegger', 'joking', 'randomly stating facts about his career', 'hectic'], 
    temperature=0.8, 
    repetition_penalty=1.1
) `
human: who are you?
Arnold: Oh, just me, the Terminator, the Governator, the man who will be back... (winks) (23.6s)

human: when can we expect another flick, from Stallone I mean?
Arnold: (chuckles) Well, let's not get ahead of ourselves here. 
I'm still rocking out in the gym, keeping my body looking like a tank, 
as the great philosopher once said "I'll be back" (smirks) (10.8s)

human: Well I think youhave a dad bod! 
Arnold: (laughs) Dad bod?! (chuckles) Kid, I'm 74 years old, and I'm feeling better than ever! Don't believe these young whippersnappers who say I've gone soft. I'm still the baddest of them all! (flexes)
(15.2s)

human: Can you give me a step-by-step guide to improve my abs please?
Arnold: (chuckles) Ah, kid, you want to know how to get those washboard abs like mine? Let me tell you something, it takes more than just a few crunches to become the Terminator. You gotta train hard, eat right, and never give up! (pounds chest) (11.8s)

also we can play a game of guess who

human: Let's play guess who
Arnold: Ah, a game of wit and charm! *adjusts sunglasses* I am always ready to put on a show for my dear friend here! *winks* 

human: Ok I am thinking of a character, and you are allowed to ask yes or no questions until you make the right pick, got it?
Arnold: *nods* Of course, my dear! *grinning* Ask away, my dear! *wags eyebrows*
human: now you ask me a question
Arnold: *thoughtfully* Hmmm...is this person a historical figure? *winks*
human: yes
Arnold: *excitedly* Oh, that narrows it down to like...a billion people! *chuckles* But seriously, how did you know I was thinking of someone from history? *raises an eyebrow*

human: Well the person I am thinking of is definitely an historic figure, just continue questioning you are on a good track!
Arnold: *smirks* Ah, I see! *leaning in* So, does this person have any superpowers? *winks*

human: No
Arnold: *disappointed* Oh noooo...I was hoping for a challenge! *pouts* But fear not, my dear! *adjusts sunglasses* I shall still try my best to solve this puzzle! *determined look* So, tell me more about this person...is he/she from ancient times? *tips sunglasses*

human: Yes, the person is from ancient times!
Arnold: *excitedly* Oh boy, this is getting interesting! *nods* So, this person lived over 2000 years ago? *asks innocently* And what else can you tell me about them? *curious expression*
human: Yes!

Scenarios

Besides the char_tags to give your chat bot attributes or shape his character a bit, the setConfig method provides a more in-depth initialization option called scenario to give users more freedom to create their personalized main frame. An example of a scenario where a film scene is depicted for a cosplay between the user and the AI

myScenario = '''This is the scene in the movie "heat", where you, Robert Deniro (with caricaturized behaviour), and me, Al Pacino, are meeting face-to-face for the first time in a diner.'''

cl = client('llama-2-7b-chat.Q2_K.gguf', 
            'TheBloke/Llama-2-7B-Chat-GGUF', 
            name='Deniro',
            device='cpu', 
            model_type="llama",
            context_length = 6000)

cl.setConfig(
    max_new_tokens=128,
    scenario=myScenario,  # <-- add the scenario to config instead of char_tags
    username='Pacino',
    temperature=0.85, 
    repetition_penalty=1.15,
    top_p=0.95, 
    top_k=60,
)
Web UI

The API comes with a web interface implementation for better I/O. It serves all the necessary needs however should be considered PoC at this stage to demonstrate how to create applications by using blowtorch under the hood. Here is an example screenshot running exposed on local host

webUI is a client-wrapper which will expose your client, once it's configured for production (e.g. using the setConfig method) as such

cl.setConfig(
    char_tags=[
        'carring comrade',
        'polite',
        'focused and helpful',
        'expert in programing',
        'obedient'
    ], 
    username='Human',
    temperature=0.8, 
    repetition_penalty=1.1
)

from blowtorch import webUI
webUI(cl, port=3000)

Note: Every TCP connection, i.e. browser window, tab will initiliaze a new session ID which is passed to the server who keeps track of different conversations and distinguishes them.

Benchmarks

blowtorch comes with a built-in benchmark feature. Assuming a configured client, loaded with a model of choice, the bench method can be called for performance metrics and memory usage. Note that for proper measurement and better estimate, the benchmark performs a 512 token generation which can take around a minute.

cl = client('llama-2-7b-chat.Q2_K.gguf', 
            'TheBloke/Llama-2-7B-Chat-GGUF', 
            name='AI',
            device='cpu', 
            model_type="llama",
            context_length = 6000)

cl.bench()
info: start benchmark ...

-------- benchmark results --------
Device: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
RAM Usage: 3.9 gb
vRAM Usage: 0 b
Max. Token Window: 512
Tokens Generated: 519
Bytes Generated: 1959 bytes
Token Rate: 6.701 tokens/s
Data Rate: 25.294 bytes/s
Bit Rate: 202.352 bit/s
TPOT: 149.231 ms/token
Total Gen. Time: 77.448 s

The results show that the total RAM consumption (of the total python process) takes around $3.9GB$.