You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try running the given command, setting MODEL_PATH=/path/to/en_moe_lm_15b/model.pt and DATA_PATH=/path/to/data-bin/wikitext-103. I get the following error:
Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/eval_lm.py", line 384, in main
models, model_args, task = checkpoint_utils.load_model_ensemble_and_task(
File "/data/meenakshi/MoE/fairseq/fairseq/checkpoint_utils.py", line 478, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
File "/data/meenakshi/MoE/fairseq/fairseq/models/fairseq_model.py", line 126, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([51200, 768]) from checkpoint, the shape in current model is torch.Size([267744, 768]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([51200, 768]) from checkpoint, the shape in current model is torch.Size([267744, 768]).
I understand that this is due to the dict.txt of the wikitext-103 dataset being a different size than the one used to train the moe-15b model, but how do I fix this? I cannot find any information about the dict.txt of the moe-15b model.
——————
I have also tried to train my own moe models using the following command:
Running inference on this model using a similar command as above originally results in errors like this:
Model file not found: checkpoints/moe_wikitext-103/checkpoint_last-rank-7.pt
So I edited the eval_lm.py file to add "-shard{rank}" to the end of the files. After trying that I get this error:
Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/eval_lm.py", line 384, in main
models, model_args, task = checkpoint_utils.load_model_ensemble_and_task(
File "/data/meenakshi/MoE/fairseq/fairseq/checkpoint_utils.py", line 478, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
File "/data/meenakshi/MoE/fairseq/fairseq/models/fairseq_model.py", line 126, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50000, 1024]) from checkpoint, the shape in current model is torch.Size([267744, 1024]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([50000, 1024]) from checkpoint, the shape in current model is torch.Size([267744, 1024]).
This is similar to the previous error but it makes no sense to me as I trained the model on the same dataset that I am trying to evaluate it on.
Environment details:
fairseq Version: moe branch
PyTorch Version: 2.0.1
OS: Linux
How you installed fairseq: source
Build command you used: pip install --editable ./
Python version: 3.9.19
CUDA version: 3.7
GPU models and configuration: 8 Tesla P100-PCIE-16GB GPUs
The text was updated successfully, but these errors were encountered:
The inference command provided for MoE models gives errors on both the provided pre-trained MoE models and on the ones I have trained myself.
This is the command that I am trying to use:
DATA_PATH=/path/to/data-bin
MODEL_PATH=/path/to/model.pt
python -m fairseq_cli.eval_lm
$DATA_DIR
--path $MODEL_PATH
--gen-subset valid
--sample-break-mode none
--tokens-per-sample 2048
--batch-size 1
--fp16
--output-word-probs
--is-moe
--distributed-world-size 8
--model-overrides "{'world_size': 8, 'moe_eval_capacity_token_fraction': 0.05}"
I downloaded the pre-trained moe_15b model from the moe_lm README onto my machine, and it unzips into a directory that looks like this:
en_moe_lm_15b:
—model-rank-0.pt
—model-rank-1.pt
...
—model-rank-63.pt
—model-shared.pt
I try running the given command, setting MODEL_PATH=/path/to/en_moe_lm_15b/model.pt and DATA_PATH=/path/to/data-bin/wikitext-103. I get the following error:
Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/eval_lm.py", line 384, in main
models, model_args, task = checkpoint_utils.load_model_ensemble_and_task(
File "/data/meenakshi/MoE/fairseq/fairseq/checkpoint_utils.py", line 478, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
File "/data/meenakshi/MoE/fairseq/fairseq/models/fairseq_model.py", line 126, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([51200, 768]) from checkpoint, the shape in current model is torch.Size([267744, 768]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([51200, 768]) from checkpoint, the shape in current model is torch.Size([267744, 768]).
I understand that this is due to the dict.txt of the wikitext-103 dataset being a different size than the one used to train the moe-15b model, but how do I fix this? I cannot find any information about the dict.txt of the moe-15b model.
——————
I have also tried to train my own moe models using the following command:
NUM_EXPERTS=8
TOKENS_PER_SAMPLE=1024
fairseq-train --task language_modeling
data-bin/wikitext-103
--save-dir checkpoints/moe_wikitext-103
--tokens-per-sample $TOKENS_PER_SAMPLE
--ddp-backend fully_sharded --memory-efficient-fp16 --checkpoint-activations
--arch transformer_lm_gpt --share-decoder-input-output-embed
--decoder-layers 24 --decoder-embed-dim 1024 --decoder-ffn-embed-dim 4096
--decoder-attention-heads 16
--moe-expert-count $NUM_EXPERTS --moe-freq 2
--moe-gating-use-fp32 --moe-second-expert-policy all
--moe-normalize-expert-grad sqrt_world_size
--moe-eval-capacity-token-fraction -1.0
--max-sentences-valid 1 --num-workers-valid 0
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum
--optimizer adam --fp16 --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr 0.0005 --warmup-updates 750
--dropout 0.2 --attention-dropout 0.2
--batch-size 2 --update-freq 2
--max-update 250 --disable-validation
--log-format json --log-interval 10
And after training I get a model directory that looks like this:
moe_wikitext-103:
—checkpoint_last-rank-0-shard0.pt
—checkpoint_last-rank-1-shard1.pt
...
—checkpoint_last-rank-7-shard7.pt
—checkpoint_last-shared-shard0.pt
...
—checkpoint_last-shared-shard7.pt
Running inference on this model using a similar command as above originally results in errors like this:
Model file not found: checkpoints/moe_wikitext-103/checkpoint_last-rank-7.pt
So I edited the eval_lm.py file to add "-shard{rank}" to the end of the files. After trying that I get this error:
Traceback (most recent call last):
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/meenakshi/MoE/fairseq/fairseq/distributed/utils.py", line 335, in distributed_main
main(cfg, **kwargs)
File "/data/meenakshi/MoE/fairseq/fairseq_cli/eval_lm.py", line 384, in main
models, model_args, task = checkpoint_utils.load_model_ensemble_and_task(
File "/data/meenakshi/MoE/fairseq/fairseq/checkpoint_utils.py", line 478, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
File "/data/meenakshi/MoE/fairseq/fairseq/models/fairseq_model.py", line 126, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/data/meenakshi/miniconda3/envs/fairseq_moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50000, 1024]) from checkpoint, the shape in current model is torch.Size([267744, 1024]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([50000, 1024]) from checkpoint, the shape in current model is torch.Size([267744, 1024]).
This is similar to the previous error but it makes no sense to me as I trained the model on the same dataset that I am trying to evaluate it on.
Environment details:
The text was updated successfully, but these errors were encountered: