Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference on MoE models #5497

Open
meenakshi-mittal opened this issue May 3, 2024 · 0 comments
Open

Inference on MoE models #5497

meenakshi-mittal opened this issue May 3, 2024 · 0 comments

Comments

@meenakshi-mittal
Copy link

The inference command provided for MoE models gives errors on both the provided pre-trained MoE models and on the ones I have trained myself.

This is the command that I am trying to use:

DATA_PATH=/path/to/data-bin
MODEL_PATH=/path/to/model.pt
python -m fairseq_cli.eval_lm
$DATA_DIR
--path $MODEL_PATH
--gen-subset valid
--sample-break-mode none
--tokens-per-sample 2048
--batch-size 1
--fp16
--output-word-probs
--is-moe
--distributed-world-size 8
--model-overrides "{'world_size': 8, 'moe_eval_capacity_token_fraction': 0.05}"

I downloaded the pre-trained moe_15b model from the moe_lm README onto my machine, and it unzips into a directory that looks like this:

en_moe_lm_15b:
—model-rank-0.pt
—model-rank-1.pt
...
—model-rank-63.pt
—model-shared.pt

I try running the given command, setting MODEL_PATH=/path/to/en_moe_lm_15b/model.pt and DATA_PATH=/path/to/data-bin/wikitext-103. I get the following error:

Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/eval_lm.py", line 384, in main
models, model_args, task = checkpoint_utils.load_model_ensemble_and_task(
File "/data/meenakshi/MoE/fairseq/fairseq/checkpoint_utils.py", line 478, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
File "/data/meenakshi/MoE/fairseq/fairseq/models/fairseq_model.py", line 126, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([51200, 768]) from checkpoint, the shape in current model is torch.Size([267744, 768]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([51200, 768]) from checkpoint, the shape in current model is torch.Size([267744, 768]).

I understand that this is due to the dict.txt of the wikitext-103 dataset being a different size than the one used to train the moe-15b model, but how do I fix this? I cannot find any information about the dict.txt of the moe-15b model.

——————

I have also tried to train my own moe models using the following command:

NUM_EXPERTS=8
TOKENS_PER_SAMPLE=1024

fairseq-train --task language_modeling
data-bin/wikitext-103
--save-dir checkpoints/moe_wikitext-103
--tokens-per-sample $TOKENS_PER_SAMPLE
--ddp-backend fully_sharded --memory-efficient-fp16 --checkpoint-activations
--arch transformer_lm_gpt --share-decoder-input-output-embed
--decoder-layers 24 --decoder-embed-dim 1024 --decoder-ffn-embed-dim 4096
--decoder-attention-heads 16
--moe-expert-count $NUM_EXPERTS --moe-freq 2
--moe-gating-use-fp32 --moe-second-expert-policy all
--moe-normalize-expert-grad sqrt_world_size
--moe-eval-capacity-token-fraction -1.0
--max-sentences-valid 1 --num-workers-valid 0
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum
--optimizer adam --fp16 --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr 0.0005 --warmup-updates 750
--dropout 0.2 --attention-dropout 0.2
--batch-size 2 --update-freq 2
--max-update 250 --disable-validation
--log-format json --log-interval 10

And after training I get a model directory that looks like this:

moe_wikitext-103:
—checkpoint_last-rank-0-shard0.pt
—checkpoint_last-rank-1-shard1.pt
...
—checkpoint_last-rank-7-shard7.pt
—checkpoint_last-shared-shard0.pt
...
—checkpoint_last-shared-shard7.pt

Running inference on this model using a similar command as above originally results in errors like this:

Model file not found: checkpoints/moe_wikitext-103/checkpoint_last-rank-7.pt

So I edited the eval_lm.py file to add "-shard{rank}" to the end of the files. After trying that I get this error:

Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/eval_lm.py", line 384, in main
models, model_args, task = checkpoint_utils.load_model_ensemble_and_task(
File "/data/meenakshi/MoE/fairseq/fairseq/checkpoint_utils.py", line 478, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
File "/data/meenakshi/MoE/fairseq/fairseq/models/fairseq_model.py", line 126, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50000, 1024]) from checkpoint, the shape in current model is torch.Size([267744, 1024]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([50000, 1024]) from checkpoint, the shape in current model is torch.Size([267744, 1024]).

This is similar to the previous error but it makes no sense to me as I trained the model on the same dataset that I am trying to evaluate it on.

Environment details:

  • fairseq Version: moe branch
  • PyTorch Version: 2.0.1
  • OS: Linux
  • How you installed fairseq: source
  • Build command you used: pip install --editable ./
  • Python version: 3.9.19
  • CUDA version: 3.7
  • GPU models and configuration: 8 Tesla P100-PCIE-16GB GPUs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant