Skip to content
This repository has been archived by the owner on Apr 11, 2024. It is now read-only.

KompleteAI/xllm

Repository files navigation

🦖 X—LLM: Simple & Cutting Edge LLM Finetuning

Build Github: License Github: Release

PyPI - Version PyPI - Downloads PyPI - Python Version

Pre-commit Code style: black Ruff Checked with mypy codecov

Easy & cutting edge LLM finetuning using the most advanced methods (QLoRA, DeepSpeed, GPTQ, Flash Attention 2, FSDP, etc)

Developed by @BobaZooba | CV | LinkedIn | bobazooba@gmail.com

Why you should use X—LLM 🪄

Are you using Large Language Models (LLMs) for your work and want to train them more efficiently with advanced methods? Wish to focus on the data and improvements rather than repetitive and time-consuming coding for LLM training?

X—LLM is your solution. It's a user-friendly library that streamlines training optimization, so you can focus on enhancing your models and data. Equipped with cutting-edge training techniques, X—LLM is engineered for efficiency by engineers who understand your needs.

X—LLM is ideal whether you're gearing up for production or need a fast prototyping tool.

Features

  • Hassle-free training for Large Language Models
  • Seamless integration of new data and data processing
  • Effortless expansion of the library
  • Speed up your training, while simultaneously reducing model sizes
  • Each checkpoint is saved to the 🤗 HuggingFace Hub
  • Easy-to-use integration with your existing project
  • Customize almost any part of your training with ease
  • Track your training progress using W&B
  • Supported many 🤗 Transformers models like Yi-34B, Mistal AI, Llama 2, Zephyr, OpenChat, Falcon, Phi, Qwen, MPT and many more
  • Benefit from cutting-edge advancements in LLM training optimization
    • QLoRA and fusing
    • Flash Attention 2
    • Gradient checkpointing
    • bitsandbytes
    • GPTQ (including post-training quantization)
    • DeepSpeed
    • FSDP
    • And many more

Quickstart 🦖

Installation

X—LLM is tested on Python 3.8+, PyTorch 2.0.1+ and CUDA 11.8.

pip install xllm

Version which include deepspeed, flash-attn and auto-gptq:

pip install xllm[train]

Default xllm version recommended for local development, xllm[train] recommended for training.

Training recommended environment

CUDA version: 11.8
Docker: huggingface/transformers-pytorch-gpu:latest

Fast prototyping ⚡

from xllm import Config
from xllm.datasets import GeneralDataset
from xllm.experiments import Experiment

# Init Config which controls the internal logic of xllm
config = Config(model_name_or_path="HuggingFaceH4/zephyr-7b-beta")

# Prepare the data
train_data = ["Hello!"] * 100
train_dataset = GeneralDataset.from_list(data=train_data)

# Build Experiment from Config: init tokenizer and model, apply LoRA and so on
experiment = Experiment(config=config, train_dataset=train_dataset)
experiment.build()

# Run Experiment (training)
experiment.run()

# [Optional] Fuse LoRA layers
experiment.fuse_lora()

# [Optional] Push fused model (or just LoRA weight) to the HuggingFace Hub
experiment.push_to_hub(repo_id="YOUR_NAME/MODEL_NAME")
LoRA

Simple

config = Config(
    model_name_or_path="openchat/openchat_3.5",
    apply_lora=True,
)

Advanced

config = Config(
    model_name_or_path="openchat/openchat_3.5",
    apply_lora=True,
    lora_rank=8,
    lora_alpha=32,
    lora_dropout=0.05,
    raw_lora_target_modules="all",
    # Names of modules to apply LoRA. A comma-separated string, for example: "k,q,v" or "all".
)
QLoRA

To train the QLoRA model, we need to load the backbone model using bitsandbytes library and int4 (or int8) weights.

Simple

config = Config(
    model_name_or_path="01-ai/Yi-34B",
    apply_lora=True,
    load_in_4bit=True,
    prepare_model_for_kbit_training=True,
)

Advanced

config = Config(
    model_name_or_path="01-ai/Yi-34B",
    apply_lora=True,
    load_in_4bit=True,
    prepare_model_for_kbit_training=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
Stabilize training

This technique helps to translate some operations into fp32 for learning stability. It is also useful to use together with LoRA and GPUs that support bfloat16.

config = Config(
    model_name_or_path="HuggingFaceH4/zephyr-7b-beta",
    stabilize=True,
)
Push checkpoints to the HuggingFace Hub

Before that, you must log in to Huggingface Hub or add an API Token to the environment variables.

config = Config(
    model_name_or_path="HuggingFaceH4/zephyr-7b-beta",
    push_to_hub=True,
    hub_private_repo=True,
    hub_model_id="BobaZooba/AntModel-7B-XLLM-Demo-LoRA",
    save_steps=25,
)
  • Checkpoints will be saved locally and in Huggingface Hub each save_steps
  • If you train a model with LoRA, then only LoRA weights will be saved
Report to W&B

Before that, you must log in to W&B or add an API Token to the environment variables.

config = Config(
    model_name_or_path="HuggingFaceH4/zephyr-7b-beta",
    report_to_wandb=True,
    wandb_project="xllm-demo",
    wandb_entity="bobazooba",
)
Gradient checkpointing

This will help to use less GPU memory during training, that is, you will be able to learn more than without this technique. The disadvantages of this technique is slowing down the forward step, that is, slowing down training.

You will be training larger models (for example 7B in colab), but at the expense of training speed.

config = Config(
    model_name_or_path="HuggingFaceH4/zephyr-7b-beta",
    use_gradient_checkpointing=True,
)
Flash Attention 2

This speeds up training and GPU memory consumption, but it does not work with all models and GPUs. You also need to install flash-attn for this. This can be done using:

pip install xllm[train]

config = Config(
    model_name_or_path="meta-llama/Llama-2-7b-hf",
    use_flash_attention_2=True,
)
Combine all

Features

  • QLoRA
  • Gradient checkpointing
  • Flash Attention 2
  • Stabilize training
  • Push checkpoints to HuggingFace Hub
  • W&B report
config = Config(
    model_name_or_path="meta-llama/Llama-2-7b-hf",
    use_gradient_checkpointing=True,
    stabilize=True,
    use_flash_attention_2=True,
    load_in_4bit=True,
    prepare_model_for_kbit_training=True,
    apply_lora=True,
    warmup_steps=1000,
    max_steps=10000,
    logging_steps=1,
    save_steps=1000,

    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    max_length=2048,

    tokenizer_padding_side="right",  # good for llama2

    push_to_hub=False,
    hub_private_repo=True,
    hub_model_id="BobaZooba/SupaDupaLlama-7B-LoRA",

    report_to_wandb=False,
    wandb_project="xllm-demo",
    wandb_entity="bobazooba",
)
Fuse

This operation is only for models with a LoRA adapter.

You can explicitly specify to fuse the model after training.

config = Config(
    model_name_or_path="HuggingFaceH4/zephyr-7b-beta",
    apply_lora=True,
    fuse_after_training=True,
)

Even when you are using QLoRa

config = Config(
    model_name_or_path="HuggingFaceH4/zephyr-7b-beta",
    apply_lora=True,
    load_in_4bit=True,
    prepare_model_for_kbit_training=True,
    fuse_after_training=True,
)

Or you can fuse the model yourself after training.

experiment.fuse_lora()
DeepSpeed

DeepSpeed is needed for training models on multiple GPUs. DeepSpeed allows you to efficiently manage the resources of several GPUs during training. For example, you can distribute the gradients and the state of the optimizer to several GPUs, rather than storing a complete set of gradients and the state of the optimizer on each GPU. Starting training using DeepSpeed can only happen from the command line.

train.py

from xllm import Config
from xllm.datasets import GeneralDataset
from xllm.cli import cli_run_train

if __name__ == '__main__':
    train_data = ["Hello!"] * 100
    train_dataset = GeneralDataset.from_list(data=train_data)
    cli_run_train(config_cls=Config, train_dataset=train_dataset)

Run train (in the num_gpus parameter, specify as many GPUs as you have)

deepspeed --num_gpus=8 train.py --deepspeed_stage 2

You also can pass other parameters

deepspeed --num_gpus=8 train.py \
  --deepspeed_stage 2 \
  --apply_lora True \
  --stabilize True \
  --use_gradient_checkpointing True

Colab notebooks

Name Comment Link
X—LLM Prototyping In this notebook you will learn the basics of the library xllm_prototyping
Llama2 & Mistral AI efficient fine-tuning 7B model training in colab using QLoRA, bnb int4, gradient checkpointing and X—LLM Llama2MistalAI

Production solution 🚀

X—LLM enables not only to prototype models, but also facilitates the development of production-ready solutions through built-in capabilities and customization.

Using X—LLM to train a model is simple and involves these few steps:

  1. Download — Get the data and the model ready by downloading and preparing them. Saves data locally to config.train_local_path_to_data and config.eval_local_path_to_data if you are using eval dataset.
  2. Train — Use the data prepared in the previous step to train the model.
  3. Fuse — If you used LoRA during the training, fuse LoRA.
  4. GPTQ Quantization — Make your model take less space by quantizing it.

Remember, these tasks in X—LLM start from the command line. So, when you're all set to go, launching your full project will look something like this:

Example how to run your project
  1. Downloading and preparing data and model

    python3 MY_PROJECT/cli/download.py \
      --dataset_key MY_DATASET \
      --model_name_or_path mistralai/Mistral-7B-v0.1 \
      --path_to_env_file ./.env
  2. Run train using DeepSpeed on multiple GPUs

    deepspeed --num_gpus=8 MY_PROJECT/cli/train.py \
      --use_gradient_checkpointing True \
      --deepspeed_stage 2 \
      --stabilize True \
      --model_name_or_path mistralai/Mistral-7B-v0.1 \
      --use_flash_attention_2 False \
      --load_in_4bit True \
      --apply_lora True \
      --raw_lora_target_modules all \
      --per_device_train_batch_size 8 \
      --warmup_steps 1000 \
      --save_total_limit 0 \
      --push_to_hub True \
      --hub_model_id MY_HF_HUB_NAME/LORA_MODEL_NAME \
      --hub_private_repo True \
      --report_to_wandb True \
      --path_to_env_file ./.env
  3. Fuse LoRA

    python3 MY_PROJECT/cli/fuse.py \
      --model_name_or_path mistralai/Mistral-7B-v0.1 \
      --lora_hub_model_id MY_HF_HUB_NAME/LORA_MODEL_NAME \
      --hub_model_id MY_HF_HUB_NAME/MODEL_NAME \
      --hub_private_repo True \
      --force_fp16 True \
      --fused_model_local_path ./fused_model/ \
      --path_to_env_file ./.env
  4. [Optional] GPTQ quantization of the trained model with fused LoRA

     python3 MY_PROJECT/cli/gptq_quantize.py \
       --model_name_or_path ./fused_model/ \
       --apply_lora False \
       --stabilize False \
       --quantization_max_samples 128 \
       --quantized_model_path ./quantized_model/ \
       --prepare_model_for_kbit_training False \
       --quantized_hub_model_id MY_HF_HUB_NAME/MODEL_NAME_GPTQ \
       --quantized_hub_private_repo True \
       --path_to_env_file ./.env

Right now, the X—LLM library lets you use only the SODA dataset. We've set it up this way for demo purposes, but we're planning to add more datasets soon. You'll need to figure out how to download and handle your dataset. Simply put, you take care of your data, and X—LLM handles the rest. We've done it this way on purpose, to give you plenty of room to get creative and customize to your heart's content.

You can customize your dataset in detail, adding additional fields. All of this will enable you to implement virtually any task in the areas of Supervised Learning and Offline Reinforcement Learning.

At the same time, you always have an easy way to submit data for language modeling.

Example
from xllm import Config
from xllm.datasets import GeneralDataset
from xllm.cli import cli_run_train

if __name__ == '__main__':
    train_data = ["Hello!"] * 100
    train_dataset = GeneralDataset.from_list(data=train_data)
    cli_run_train(config_cls=Config, train_dataset=train_dataset)

Build your own project

To set up your own project using X—LLM, you need to do two things:

  1. Implement your dataset (figure out how to download and handle it)
  2. Add X—LLM's command-line tools into your project

Once that's done, your project will be good to go, and you can start running the steps you need (like download, train, and so on).

To get a handle on building your project with X—LLM, check out the materials below.

Useful materials

Config 🔧

The X—LLM library uses a single config setup for all steps like downloading, training and the other steps. It's designed in a way that lets you easily understand the available features and what you can adjust. Config has control almost over every single part of each step. Thanks to the config, you can pick your dataset, set your collator, manage the type of quantization during training, decide if you want to use LoRA, if you need to push a checkpoint to the HuggingFace Hub, and a lot more.

Config path: src.xllm.core.config.Config

Or

from xllm import Config

Useful materials

Customization options 🛠

You have the flexibility to tweak many aspects of your model's training: data, how data is processed, trainer, config, how the model is loaded, what happens before and after training, and so much more.

We've got ready-to-use components for every part of the xllm pipeline. You can entirely switch out some components like the dataset, collator, trainer, and experiment. For some components like experiment and config, you have the option to just build on what's already there.

Useful materials

Projects using X—LLM 🏆

Building something cool with X—LLM? Kindly reach out to me at bobazooba@gmail.com. I'd love to hear from you.

Hall of Fame

Write to us so that we can add your project.

Badge

Consider adding a badge to your model card.

[<img src="https://github.com/BobaZooba/xllm/blob/main/static/images/xllm-badge.png" alt="Powered by X—LLM" width="175" height="32"/>](https://github.com/KompleteAI/xllm)

Powered by X—LLM

Testing 🧪

At the moment, we don't have Continuous Integration tests that utilize a GPU. However, we might develop these kinds of tests in the future. It's important to note, though, that this would require investing time into their development, as well as funding for machine maintenance.

Future Work 🔮

  • Add more tests
  • GPU CI using RunPod
  • Add RunPod deploy function
  • Add multipacking
  • Add adaptive batch size
  • Fix caching in CI
  • Add sequence bucketing
  • Add more datasets
  • Maybe add tensor_parallel

Tale Quest

Tale Quest is my personal project which was built using xllm and Shurale. It's an interactive text-based game in Telegram with dynamic AI characters, offering infinite scenarios

You will get into exciting journeys and complete fascinating quests. Chat with George Orwell, Tech Entrepreneur, Young Wizard, Noir Detective, Femme Fatale and many more

Try it now: https://t.me/talequestbot

Please support me here

Buy Me A Coffee