YuanGenerativeLM

Generative Language Model Pretrained on Inspur's Yuan Dataset, codebase for ASC22 supercomputing competition

Project Structure

To simplify experiments on different distributed training frameworks, we decoupled the training code into config, data, model and trainer modules.

The idea of this decoupling is inspired by pytorch-lightning, however we decoupled it even further to make it more flexible when integrating with other frameworks.

`config` Module

We put all hyperparameters and configurations into config module for better tracing and logging.

`data` Module

We directly use pytorch-lightning.LightningDataModule since it's interface is well-designed and easy to use.

`model` Module

Since most distributed training framework need to wrap the model before or after model initialization, and pytorch-lightning.LightningModule has already exposed some problem in integrating multiple frameworks simultaneously, we decide to further decouple this module into BaseModel class.

The BaseModel directly inherits nn.Module, which is the compatible for most of the distributed training frameworks. All implementations of the language model are derived from BaseModel and maintain only the model config, the model structure, the forward method, the loss function and the optimizer.

Currently, implemented models include:

native model: written in native pytorch
huggingface model: written in HuggingFace's transformers

`trainer` Module

Now we put everything else like model initialization, training, validation and testing into trainer module. All training preparation and iterations are done here.

Currently, implemented trainers include:

PytorchLightning trainer: distributed training with pytorch-lightning, with deepspeed integration provided by the lightning team
PatrickStar Trainer

Distributed Launch

Below are examples of how to launch the training job on different distributed frameworks.

DDP in PyTorch-Lightning

num_nodes must be set to number of GPUs in all nodes, otherwise it will use the number of GPUs in the master node.

torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1 train.ddp_pl.py

DeepSpeed in PyTorch-Lightning

OMP_NUM_THREADS=32 torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1 train.ds_pl.py

Note that OMP_NUM_THREADS is a must when offload is used, since Optimizer now runs on CPU.

Horovod in PyTorch-Lightning

horovodrun -np 2 python train.hvd_pl.py

We still prefer to use torchrun

PatrickStar

torchrun --nnodes=1 --nproc_per_node=2 train.pstar.py

Colossal AI

GLOO_SOCKET_IFNAME=ibs5 OMP_NUM_THREADS=32 torchrun --master_addr="172.25.2.105" --master_port=29500 --nnodes=2 --node_rank=1 --nproc_per_node=2 train.col_ai.py --config=trainer/colossal_ai/strategy.py

Run Profile

OMP_NUM_THREADS=32 nsys profile -o cpu_adam torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 0 train.ds_pl.py

OMP_NUM_THREADS=32 nsys profile --gpu-metrics-device=all --gpuctxsw=true --nic-metrics=true --cuda-memory-usage=true --cudabacktrace=all torchrun  --nnodes=2 --nproc_per_node=2 train.col_ai.py --config=trainer/colossal_ai/strategy.py

Docker Environment

docker run -it --name pytorch --gpus all --privileged --cap-add=SYS_ADMIN --ipc=host --network=host --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/infiniband -v $(pwd):/workspace registry.cn-hangzhou.aliyuncs.com/ncj/pytorch bash

Check details in Dockerfile

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
config		config
data		data
model		model
trainer		trainer
util		util
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker.sh		docker.sh
hostfile		hostfile
launch.sh		launch.sh
path		path
requirements.txt		requirements.txt
train.col_ai.py		train.col_ai.py
train.col_test.py		train.col_test.py
train.ddp_pl.py		train.ddp_pl.py
train.deepspeed.py		train.deepspeed.py
train.ds_pl.py		train.ds_pl.py
train.fsdp_pl.py		train.fsdp_pl.py
train.hvd_pl.py		train.hvd_pl.py
train.pl.py		train.pl.py
train.pstar.py		train.pstar.py
train.test.py		train.test.py

License

iamNCJ/YuanGPT

Folders and files

Latest commit

History

Repository files navigation

YuanGenerativeLM

Project Structure

config Module

data Module

model Module

trainer Module

Distributed Launch

DDP in PyTorch-Lightning

DeepSpeed in PyTorch-Lightning

Horovod in PyTorch-Lightning

PatrickStar

Colossal AI

Run Profile

Docker Environment

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`config` Module

`data` Module

`model` Module

`trainer` Module