Skip to content

pjlab-sys4nlp/llama-moe

Repository files navigation

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

LLaMA-MoE favicon
📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!! 📃 Technical Report

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

  1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
  2. Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

MoE Routing

🔥 Features

  1. Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
  2. Multiple Expert Construction Methods:
    1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
    2. Neuron-Sharing: Inner, Inter (residual)
  3. Multiple MoE Gating Strategies:
    1. TopK Noisy Gate (Shazeer et al., 2017)
    2. Switch Gating (Fedus et al., 2022)
  4. Fast Continual Pre-training:
    1. FlashAttention-v2 integrated (Dao, 2023)
    2. Fast streaming dataset loading
  5. Abundant Monitor Items:
    1. Gate load, gate importance
    2. Loss on steps, loss on tokens, balance loss
    3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
    4. Other visualization utilities
  6. Dynamic Weight Sampling:
    1. Self-defined static sampling weights
    2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

  1. Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)
  2. Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:
    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
  3. Take the variables into effect: source ~/.bashrc
  4. Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  5. Install dependencies: pip install -r requirements.txt
  6. Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
  7. Install the latest Git: conda install git
  8. Clone the repo: git clone git@github.com:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
  9. Change current directory: cd llama-moe
  10. Install smoe in editable mode: pip install -e .[dev]
  11. Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model #Activated Experts #Experts #Activated Params Foundation Model SFT Model
LLaMA-MoE-3.0B 2 16 3.0B 🤗 base 🤗 SFT
LLaMA-MoE-3.5B (4/16) 4 16 3.5B 🤗 base 🤗 SFT
LLaMA-MoE-3.5B (2/8) 2 8 3.5B 🤗 base 🤗 SFT
  • Foundation models
Model Average SciQ PIQA WinoGrande ARC-e ARC-c (25) HellaSwag (10) LogiQA BoolQ (32) LAMBADA NQ (32) MMLU (5)
OPT-2.7B 50.3 78.9 74.8 60.8 54.4 34.0 61.4 25.8 63.3 63.6 10.7 25.8
Pythia-2.8B 51.5 83.2 73.6 59.6 58.8 36.7 60.7 28.1 65.9 64.6 8.7 26.8
INCITE-BASE-3B 53.7 85.6 73.9 63.5 61.7 40.3 64.7 27.5 65.8 65.4 15.2 27.2
Open-LLaMA-3B-v2 55.6 88.0 77.9 63.1 63.3 40.1 71.4 28.1 69.2 67.4 16.0 26.8
Sheared-LLaMA-2.7B 56.4 87.5 76.9 65.0 63.3 41.6 71.0 28.3 73.6 68.3 17.6 27.3
LLaMA-MoE-3.0B 55.5 84.2 77.5 63.6 60.2 40.9 70.8 30.6 71.9 66.6 17.0 26.8
LLaMA-MoE-3.5B (4/16) 57.7 87.6 77.9 65.5 65.6 44.2 73.3 29.7 75.0 69.5 20.3 26.8
LLaMA-MoE-3.5B (2/8) 57.6 88.4 77.6 66.7 65.3 43.1 73.3 29.6 73.9 69.4 19.8 27.0
  • SFT models
Model MMLU ARC-c HellaSeag TruthfulQA MT-Bench
Sheared LLaMA-2.7B ShareGPT 28.41 41.04 71.21 47.65 3.79
Sheared LLaMA-2.7B Deita6K (Our Impl.) 25.24 43.69 71.70 49.00 4.06
LLaMA-MoE-v1-3.0B (2/16) 23.61 43.43 72.28 44.24 4.15
LLaMA-MoE-v1-3.5B (4/16) 26.49 48.29 75.10 45.91 4.60
LLaMA-MoE-v1-3.5B (2/8) 25.53 45.99 74.95 44.39 4.72

🚧 Expert Construction

  • Neuron-Independent
    • IndependentRandom: bash ./scripts/expert_construction/split/run_split_random.sh
    • IndependentClustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
  • Neuron-Sharing
    • SharingInner: bash ./scripts/expert_construction/split/run_split_gradient.sh
    • SharingInter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

  • /path_to_data/en_arxiv
  • /path_to_data/en_book
  • /path_to_data/en_c4
  • /path_to_data/en_cc
  • /path_to_data/en_stack
  • /path_to_data/en_wikipedia
  • /path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

  • NOTICE: Please create logs/ folder manually: mkdir -p logs
  • To run the continual pre-training, please check the CPT docs.

💎 Evaluation

💬 Supervised Fine-Tuning (SFT)

We provide simple examples of SFT to build chatbots. Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft for more details.

📑 Citation

@misc{llama-moe-2023,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={LLaMA-MoE Team},
  year={2023},
  month={Dec},
  url={https://github.com/pjlab-sys4nlp/llama-moe}
}

LLaMA-MoE Team w/ ❤️