Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
dataset		dataset
tools		tools
training_pipeline		training_pipeline
.beamignore		.beamignore
.env.example		.env.example
Makefile		Makefile
README.md		README.md
logging.yaml		logging.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

README.md

Training / Fine-tuning Pipeline

Training pipeline that:

loads a proprietary Q&A dataset
fine-tunes an open-source LLM using QLoRA
logs the training experiments on Comet ML's experiment tracker & the inference results on Comet ML's LLMOps dashboard
stores the best model on Comet ML's model registry

The training pipeline is deployed using Beam as a serverless GPU infrastructure.

1. Motivation

The best way to specialize an LLM on your task is to fine-tune it on a small Q&A dataset (~100-1000 samples) coupled to your business use case.

In this case, we will use the finance dataset generated using the q_and_a_dataset_generator module to specialize the LLM in responding to investing questions.

2. Install

2.1. Dependencies

Main dependencies you have to install yourself:

Python 3.10
Poetry 1.5.1
GNU Make 4.3

Installing all the other dependencies is as easy as running:

make install

When developing run:

make install_dev

Prepare credentials:

cp .env.example .env

--> and complete the .env file with your external services credentials.

2.2. Beam

deploy the training pipeline to Beam [optional]

First, you must set up Beam, as explained in the Setup External Services section.

In addition to setting up Beam, you have to go to your Beam account and create a volume, as follows:

go to the Volumes section
click create New Volume (in the top right corner)
choose Volume Name = qa_dataset and Volume Type = Shared

After, run the following command to upload the Q&A dataset to the Beam volume you created:

make upload_dataset_to_beam

Finally, check out that your qa_dataset Beam volume contains the uploaded data.

IMPORTANT NOTE: The training pipeline will work only on CUDA-enabled Nvidia GPUs with ~16 GB VRAM. If you don't have one and wish to run the training pipeline, you must deploy it to Beam.

3. Usage

3.1. Train

run the training, log the experiment and model to Comet ML

Local

For debugging or to test that everything is working fine, run the following to train the model on a lower number of samples:

make dev_train_local

For training on the production configuration, run the following:

make train_local

On Beam

As for training on your local machine, for debugging or testing, run:

make dev_train_beam

For training on the production configuration, run the following:

make train_beam

3.2. Inference

run the inference & log the prompts and answers to Comet ML

Local

For testing or debugging the inference on a small subset of the dataset, run:

make dev_infer_local

To run the inference on the whole dataset, run the following:

make infer_local

Using Beam

As for doing inference on your local machine, for debugging or testing, run:

make dev_infer_beam

To run the inference on the whole dataset, run the following::

make infer_beam

3.3. Linting & Formatting

Check the code for linting issues:

make lint_check

Fix the code for linting issues (note that some issues can't automatically be fixed, so you might need to solve them manually):

make lint_fix

Check the code for formatting issues:

make format_check

Fix the code for formatting issues:

make format_fix

Files

training_pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

training_pipeline

Folders and files

parent directory

Training / Fine-tuning Pipeline

Table of Contents

1. Motivation

2. Install

2.1. Dependencies

2.2. Beam

3. Usage

3.1. Train

Local

On Beam

3.2. Inference

Local

Using Beam

3.3. Linting & Formatting