Skip to content

Latest commit

 

History

History
131 lines (97 loc) · 6.1 KB

README.md

File metadata and controls

131 lines (97 loc) · 6.1 KB

🚅 Training Guide

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

  • /path_to_data/en_arxiv
  • /path_to_data/en_book
  • /path_to_data/en_c4
  • /path_to_data/en_cc
  • /path_to_data/en_stack
  • /path_to_data/en_wikipedia
  • /path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

🗞️ Executive Scripts

Description Path
LLaMA-MoE 2/16 Experts scripts/cpt/16_2/baseline_112gpus_sheared_llama_portion_fluency_sf8.sh
LLaMA-MoE 4/16 Experts scripts/cpt/dynamic_data_selection/baseline_112gpus_sheared_llama_portion_fluency.sh
DynamicSheared scripts/cpt/dynamic_data_selection/sheared_llama_112gpus.sh

🌴 Other Arguments in Executive Scripts

Argument Name Description
--dynamic_data_selection For different dynamic data sampling strategies, choose one from: sheared_llama or none (static). Default: none
--moe_calculator_score_scale_factor Scale factor to multiply after hidden states are procesed by experts. Should be $\frac{\text{#total experts}}{\text{#selected}}$. Default: 4.0
--num_selects The number of selected experts. Default: 4
--gate_balance_loss_weight The weight of the balance loss for the gate. Default: 1e-2

📋 Checklist before Starting an Experiment

  • balance loss weight
  • scale factor
  • learning rate
  • warmup steps
  • evaluation steps
  • logging steps
  • global batch size
  • number of selected experts
  • pretrained model
  • data path
  • GPUs
  • comment

⚙️ Configuration Instructions for Slurm Users

For scripts/cpt/lora.sh and scripts/cpt/fpt.sh files, we could run an experiment via sbatch. e.g. sbatch scripts/cpt/lora.sh .

The sbatch command is similar to nohup, and the submitted job would be running on the background.

Here are some instructions for slurm configuration items:

  • --job-name: the name of a job, could be changed at your convenience.
  • --partition: the slurm partition. For MoE group, the default partition is MoE .
  • --output: the logging file of stdout. %x means the job name, and %j is the job ID.
  • --error: the logging file of stderr.
  • --ntasks-per-node: always set to 1 .
  • --cpus-per-task: may change according to different usage. The maximum CPUs for a node is 128 (according to cinfo -p MoE).
  • --nodes: how many nodes would you like to use. NOTICE: the value of num_nodes must be the same with --nodes .
  • --gres: the number of GPUs for each node (in a format of gpu:<num>, e.g. gpu:8). NOTICE: the value of num_gpu_per_node must agree with --gres.

📦 Training Different Models

model_type and pretrained_model in scripts/cpt/(lora|fpt).sh help specify the fundation model.

For vanilla LLaMA, use the following settings:

model_type="llama"
pretrained_model=/mnt/petrelfs/share_data/quxiaoye/models/llama_7B

For LLaMA-MoE, use the following settings:

model_type="llama_moe"
pretrained_model=/mnt/petrelfs/share_data/quxiaoye/models/llama_7B_MoE_16Select4-l2_norm

llama1-7b 16 select 4: 3.49b params

llama1-13b total params: 13,015,864,320 - total mlp params: 8,493,465,600

🧮 Estimation of Training Speed and Tokens

For convenient estimation of the model training speed, we provide some useful information at the very beginning of log files:

max_steps: 63578
global batch size: 768
#tokens/batch: 1572864
  • global batch size: per_device_train_batch_size * gradient_accumulation_steps * num_nodes * num_gpu_per_node
  • max_steps: how many steps should be trained to reach 100B tokens: 10^11 / (block_size * per_device_train_batch_size * gradient_accumulation_steps * num_nodes * num_gpu_per_node)
  • #tokens/batch: the number of trained tokens for one global batch

When estimating the expected time, you may want to check the running time/step via tensorboard or from the logging file.

Based on the above information, the expected time could be calculated.

🛹 Tracking Loss with Tensorboard

The tensorboard logging_dir could be found at outputs/<job-name>-<job-id>/runs/<logging-dir>.

For example, if my job name is cpt-moe-fpt-bs16-48gpus in the sbatch file, the tensorboard could be started from that by: tensorboard --logdir outputs/cpt-moe-fpt-bs16-48gpus-1535835/runs/Jul31_14-12-00 .

For multiple tasks with different logging directories, you could run the following command:

$ tensorboard --logdir_spec short_name:dir1,short_name2:dir2 --port 8001

Here, the short_name is an abbreviation for your task, and the port number could be changed manually if there's a port conflict. e.g.

$ tensorboard --logdir_spec moe_from_scratch:outputs/cpt-llama-moe-scratch-lora-bs16-1476932/runs/Jul26_21-53-42,moe_lora:outputs/cpt-llama-lora-bs16-1476918/runs/Jul26_21-31-09 --port 8001