Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some update to tr10 config #20

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

thomasw21
Copy link
Member

@thomasw21 thomasw21 commented Nov 23, 2021

This PR is for sorting out the tr10-104B config.

Copy link
Member Author

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts on the config, I'll update the readme at the end so that we leave a trail with the conclusions we come up with.

@@ -46,7 +46,7 @@ GLOBAL_BATCH_SIZE=2048

NLAYERS=40
NHIDDEN=5120
NHEADS=32
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't why we chose 32? We seem to have updated the NHIDDEN value to be 5120 because it was divisible by 128, and 5120 // 128 = 40.

https://huggingface.slack.com/archives/C01NHER1JLS/p1627034738272600?thread_ts=1626827659.189400&cid=C01NHER1JLS

cc @VictorSanh @stas00 @mryab (People who were involved in the original post)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, 530B training used:

NLAYERS=105
NHIDDEN=20480
NHEADS=128

So the same proportion as 32 and 5120

Copy link
Member

@stas00 stas00 Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.

I'm not sure if there is a research showing size of the head vs. number of the heads performance.

Comment on lines +66 to +67
--hidden-dropout 0.0 \
--attention-dropout 0.0 \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs

train/tr10-13B-ml/tr10-13B.slurm Outdated Show resolved Hide resolved
@@ -57,13 +57,14 @@ OPTIMIZER_ARGS=" \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-8 \
--lr 6e-5 \
--lr 1e-4 \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT3 paper suggest a higher learning rate. Is there a reason why we would use 6e-5?

--min-lr 6e-6 \
--lr-decay-style cosine \
--lr-decay-samples 126_953_125 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed this one w/o any commentary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original tr1-13B said:

We need lr-decay in samples, so tokens2samples = 260B / 2048 = 126_953_125

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at setting it by default to the entire number of samples we have
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L341

We have been using this in arch/scaling.

However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed.

@@ -165,7 +166,7 @@ export CMD=" \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--data-impl mmap \
--split 900,100,0 \
--split 950,50,0 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently using a small dataset, so I had to give valid a larger chunk. But for the real training this needs to be restored to the above split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants