LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

📃 Technical Report

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

🔥 Features

Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
Multiple Expert Construction Methods:
1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
2. Neuron-Sharing: Inner, Inter (residual)
Multiple MoE Gating Strategies:
1. TopK Noisy Gate (Shazeer et al., 2017)
2. Switch Gating (Fedus et al., 2022)
Fast Continual Pre-training:
1. FlashAttention-v2 integrated (Dao, 2023)
2. Fast streaming dataset loading
Abundant Monitor Items:
1. Gate load, gate importance
2. Loss on steps, loss on tokens, balance loss
3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
4. Other visualization utilities
Dynamic Weight Sampling:
1. Self-defined static sampling weights
2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)

Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:

export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH

Take the variables into effect: source ~/.bashrc
Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Install dependencies: pip install -r requirements.txt
Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
Install the latest Git: conda install git
Clone the repo: git clone git@github.com:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
Change current directory: cd llama-moe
Install smoe in editable mode: pip install -e .[dev]
Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model	#Activated Experts	#Experts	#Activated Params	Foundation Model	SFT Model
LLaMA-MoE-3.0B	2	16	3.0B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	🤗 base	🤗 SFT

Foundation models

Model	Average	SciQ	PIQA	WinoGrande	ARC-e	ARC-c (25)	HellaSwag (10)	LogiQA	BoolQ (32)	LAMBADA	NQ (32)	MMLU (5)
OPT-2.7B	50.3	78.9	74.8	60.8	54.4	34.0	61.4	25.8	63.3	63.6	10.7	25.8
Pythia-2.8B	51.5	83.2	73.6	59.6	58.8	36.7	60.7	28.1	65.9	64.6	8.7	26.8
INCITE-BASE-3B	53.7	85.6	73.9	63.5	61.7	40.3	64.7	27.5	65.8	65.4	15.2	27.2
Open-LLaMA-3B-v2	55.6	88.0	77.9	63.1	63.3	40.1	71.4	28.1	69.2	67.4	16.0	26.8
Sheared-LLaMA-2.7B	56.4	87.5	76.9	65.0	63.3	41.6	71.0	28.3	73.6	68.3	17.6	27.3
LLaMA-MoE-3.0B	55.5	84.2	77.5	63.6	60.2	40.9	70.8	30.6	71.9	66.6	17.0	26.8
LLaMA-MoE-3.5B (4/16)	57.7	87.6	77.9	65.5	65.6	44.2	73.3	29.7	75.0	69.5	20.3	26.8
LLaMA-MoE-3.5B (2/8)	57.6	88.4	77.6	66.7	65.3	43.1	73.3	29.6	73.9	69.4	19.8	27.0

SFT models

Model	MMLU	ARC-c	HellaSeag	TruthfulQA	MT-Bench
Sheared LLaMA-2.7B ShareGPT	28.41	41.04	71.21	47.65	3.79
Sheared LLaMA-2.7B Deita6K (Our Impl.)	25.24	43.69	71.70	49.00	4.06
LLaMA-MoE-v1-3.0B (2/16)	23.61	43.43	72.28	44.24	4.15
LLaMA-MoE-v1-3.5B (4/16)	26.49	48.29	75.10	45.91	4.60
LLaMA-MoE-v1-3.5B (2/8)	25.53	45.99	74.95	44.39	4.72

🚧 Expert Construction

Neuron-Independent
- Independent_Random: bash ./scripts/expert_construction/split/run_split_random.sh
- Independent_Clustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
Neuron-Sharing
- Sharing_Inner: bash ./scripts/expert_construction/split/run_split_gradient.sh
- Sharing_Inter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

NOTICE: Please create logs/ folder manually: mkdir -p logs
To run the continual pre-training, please check the CPT docs.

💎 Evaluation

For evalution on Natural Questions (NQ), please refer to opencompass.
For other tasks, please refer to lm-eval-harness.

💬 Supervised Fine-Tuning (SFT)

We provide simple examples of SFT to build chatbots. Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft for more details.

📑 Citation

@misc{llama-moe-2023,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={LLaMA-MoE Team},
  year={2023},
  month={Dec},
  url={https://github.com/pjlab-sys4nlp/llama-moe}
}

LLaMA-MoE Team w/ ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.vscode		.vscode
conf		conf
docs		docs
scripts		scripts
smoe		smoe
tests		tests
tools		tools
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
example.py		example.py
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

License

pjlab-sys4nlp/llama-moe

Folders and files

Latest commit

History

Repository files navigation

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction

🚅 Continual Pre-training

Tokenization

Continual Pre-training (CPT)

💎 Evaluation

💬 Supervised Fine-Tuning (SFT)

📑 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages