Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preserve Pythia's sampling order but for different batch size. #984

Open
lintangsutawika opened this issue Jul 3, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@lintangsutawika
Copy link
Contributor

Describe the bug

I'd like to observe whether there is any substantial effects of using different batch sizes. It makes sense to use the exact same sampling order as was done on Pythia. To do this, the idea is to set the same number of tokens for each batch size variable and increasing or decreasing the train-iters accordingly.

Double checking sampling order with utils/batch_viewer.py from Pythia, it seems that changing train_micro_batch_size_per_gpu while keeping train-iters the same doesn't affect sampling order. Modifying train-iters based on train_micro_batch_size_per_gpu to keep total number of tokens the same for each run results in different ordering.

These configuration results in the same ordering.

"train_micro_batch_size_per_gpu": 512,
"train-iters": 143000,
"train_micro_batch_size_per_gpu": 1024,
"train-iters": 143000,

This will be an issue if we want to train the same as the original Pythia (300B) tokens because changing train-iters changes the ordering and keeping it while changing train_micro_batch_size_per_gpu will not result in the same amount of tokens.

To Reproduce

  1. Using Pythia's utils/batch_viewer.py with utils/dummy_config.yml adjusted. I only observed the first 2 steps for bs512 and first step for bs1024.
  2. Detokenize the npy files and compare the text directly. I guess you can skip this part and observe the tokens directly as well.

Expected behavior
A clear and concise description of what you expected to happen.

Proposed solution
Not yet sure what the direct solution is, this might be an issue in how the dataset is loaded based on batch size.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • GPUs:
  • Configs:

Additional context
Add any other context about the problem here.

@lintangsutawika lintangsutawika added the bug Something isn't working label Jul 3, 2023
@lintangsutawika
Copy link
Contributor Author

@haileyschoelkopf @uSaiPrashanth maybe you both have any idea to this issue?

@uSaiPrashanth
Copy link
Member

From what I have observed, as long as you keep the number of epochs and sequence length the same, your batch size (or) number of train iters should not matter (ref: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/data/gpt2_dataset.py#L187)

Modifying train-iters based on train_micro_batch_size_per_gpu to keep total number of tokens the same for each run results in different ordering.

Could you check if this changes the number of epochs you're training on?

@StellaAthena
Copy link
Member

Here's a hack that should get around this: Keep train-iters unchanged but modify lr_decay_iters. This will cause the LR decay rate to act as if the training is for a shorter model, then you can deliberately crash the run once it's trained for the desired amount of tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants