Some update to tr10 config #20

thomasw21 · 2021-11-23T10:07:06Z

This PR is for sorting out the tr10-104B config.

thomasw21

Some thoughts on the config, I'll update the readme at the end so that we leave a trail with the conclusions we come up with.

thomasw21 · 2021-11-23T10:10:25Z

train/tr10-13B-ml/tr10-13B.slurm

@@ -46,7 +46,7 @@ GLOBAL_BATCH_SIZE=2048

 NLAYERS=40
 NHIDDEN=5120
-NHEADS=32


I don't why we chose 32? We seem to have updated the NHIDDEN value to be 5120 because it was divisible by 128, and 5120 // 128 = 40.

https://huggingface.slack.com/archives/C01NHER1JLS/p1627034738272600?thread_ts=1626827659.189400&cid=C01NHER1JLS

cc @VictorSanh @stas00 @mryab (People who were involved in the original post)

FWIW, 530B training used:

NLAYERS=105 NHIDDEN=20480 NHEADS=128

So the same proportion as 32 and 5120

Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.

I'm not sure if there is a research showing size of the head vs. number of the heads performance.

thomasw21 · 2021-11-23T10:13:18Z

train/tr10-13B-ml/tr10-13B.slurm

+    --hidden-dropout 0.0 \
+    --attention-dropout 0.0 \


https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs

train/tr10-13B-ml/tr10-13B.slurm

thomasw21 · 2021-11-23T10:16:54Z

train/tr10-13B-ml/tr10-13B.slurm

@@ -57,13 +57,14 @@ OPTIMIZER_ARGS=" \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
-    --lr 6e-5 \
+    --lr 1e-4 \


GPT3 paper suggest a higher learning rate. Is there a reason why we would use 6e-5?

stas00 · 2021-11-23T17:17:06Z

train/tr10-13B-ml/tr10-13B.slurm

    --min-lr 6e-6 \
    --lr-decay-style cosine \
-    --lr-decay-samples 126_953_125 \


you removed this one w/o any commentary?

The original tr1-13B said:

We need lr-decay in samples, so tokens2samples = 260B / 2048 = 126_953_125

I was looking at setting it by default to the entire number of samples we have
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L341

We have been using this in arch/scaling.

However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao

Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed.

stas00 · 2021-11-23T17:22:50Z

train/tr10-13B-ml/tr10-13B.slurm

@@ -165,7 +166,7 @@ export CMD=" \
    --load $CHECKPOINT_PATH \
    --data-path $DATA_PATH \
    --data-impl mmap \
-    --split 900,100,0 \
+    --split 950,50,0 \


currently using a small dataset, so I had to give valid a larger chunk. But for the real training this needs to be restored to the above split.

train/tr10-13B-ml/tr10-13B.slurm

thomasw21 added 2 commits November 23, 2021 11:06

Some update to tr10 config

5766b5d

Woops

11433bd

thomasw21 commented Nov 23, 2021

View reviewed changes

thomasw21 requested review from stas00, ibeltagy, TevenLeScao and VictorSanh November 23, 2021 10:17

stas00 reviewed Nov 23, 2021

View reviewed changes

restore the split

d50b067

stas00 reviewed Nov 23, 2021

View reviewed changes

Update the formula for computing the number of samples

13b93c6

stas00 reviewed Nov 24, 2021

View reviewed changes

train/tr10-13B-ml/tr10-13B.slurm Outdated Show resolved Hide resolved

Woops

673e801

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some update to tr10 config #20

Some update to tr10 config #20

thomasw21 commented Nov 23, 2021 •

edited by stas00

thomasw21 left a comment

thomasw21 Nov 23, 2021

stas00 Nov 23, 2021

stas00 Nov 23, 2021 •

edited

thomasw21 Nov 23, 2021

thomasw21 Nov 23, 2021

stas00 Nov 23, 2021

stas00 Nov 23, 2021

thomasw21 Nov 23, 2021

stas00 Nov 23, 2021

stas00 Nov 23, 2021

Some update to tr10 config #20

Are you sure you want to change the base?

Some update to tr10 config #20

Conversation

thomasw21 commented Nov 23, 2021 • edited by stas00

thomasw21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 Nov 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 commented Nov 23, 2021 •

edited by stas00

stas00 Nov 23, 2021 •

edited