Adding replay into GPT-NeoX #1200

AIproj · 2024-04-13T00:08:51Z

This PR aims to add replay to GPT-NeoX. I had implemented this for the paper Simple and Scalable Strategies to Continually Pre-train Large Language Models that shows simple ways to efficiently continue to pretrain by improving adaptation to new data while mitigating forgetting of previous data. Note that this PR can serve as a basis to add the ability to resume training from a certain index in a dataset, based on how I implemented this feature for replay datasets.

How to use

I tried to make the descriptions of the replay args informative enough to serve as documentation. An example of a config using replay is also provided in tests/config/example_replay_config.yml.

Unsupported/untested features:

(UNTESTED) Using replay AND weighting by number of documents. There's an assert to throw an error if someone tries to use both.
(UNSUPPORTED) Using replay AND splitting the datasets automatically instead of providing separate train, val and test paths. There's an assert to throw an error if someone tries to use both.
(UNSUPPORTED) Using replay AND label data. There's an assert to throw an error if someone tries to use both. As indicated in comments, it might be doable by adding a replay_label_data arg that would specify the prefix to the idx and data path of replay label data, then generate the specific replay label data path from the prefix, and treat it in a similar way as the training data in the block

    # The concatenate_train_replay_paths bool is necessary to avoid issues when this function gets called a second time.
    if neox_args.is_replay_enabled and concatenate_train_replay_paths:
        # Merge replay data paths into train data paths logic, but need to keep track of
        # what paths in train_data_paths came from replay
        num_replay_data_paths = len(neox_args.replay_data_paths)
        num_non_replay_data_paths = len(neox_args.train_data_paths)
        neox_args.train_data_paths += neox_args.replay_data_paths

Pending tests

Currently, the tests required are:

Sanity check that first few batches are the same with/without these changes.
Similarly as above, check that label data support did not break with this.
Sanity check that given two datasets, not using replay but having 0.5 weights for each is the same as setting one dataset as training dataset, and the other as replay dataset with replay fraction 0.5.

The tests can follow the procedure described in tests/model/test_batch_replicability.py. Tests 1 and 3 were passed with the Summit version of NeoX, but I'll need to run them again on the replay implementation based on the current main branch of NeoX. I'll probably need someone else to test that label data support (test 2) did not break as I'm unfamiliar with this feature of NeoX and am currently too busy to take that on.

This reverts commit 5cdff76.

This reverts commit 6ff3ae6.

bentherien · 2024-04-14T20:42:32Z

Please ignore the above commits. I accidentally pushed to upstream when modifying this branch in my fork.

StellaAthena · 2024-04-15T11:15:19Z

configs/neox_arguments.md

+
+    Default = 0.05
+
+    Fraction of a batch dedicated to doing replay. For example, 0.1 means that in a batch of 100, 19 samples will come from the replay


For example, 0.1 means that in a batch of 100, 19 samples will come from the replay buffer.

Is this a typo? Why wouldn't it be 10 samples?

StellaAthena · 2024-04-15T11:16:56Z

configs/neox_arguments.md

+
+- **replay_seed**: int
+
+    Default = 1234


It seems important that the replay seed isn't the same as the general data seed from your other comments. If that's correct, let's use a different default.

Initial commit of replay

defa0a4

AIproj requested a review from haileyschoelkopf April 13, 2024 00:08

AIproj self-assigned this Apr 13, 2024

AIproj requested a review from Quentin-Anthony as a code owner April 13, 2024 00:08

github-actions and others added 7 commits April 13, 2024 00:09

Update NeoXArgs docs automatically

5cdff76

added CPT code

6ff3ae6

Update NeoXArgs docs automatically

d661340

Revert "Update NeoXArgs docs automatically"

2d33aaa

This reverts commit 5cdff76.

Update NeoXArgs docs automatically

dd6d832

Revert "added CPT code"

621ab25

This reverts commit 6ff3ae6.

Update NeoXArgs docs automatically

9622642

StellaAthena reviewed Apr 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding replay into GPT-NeoX #1200

Adding replay into GPT-NeoX #1200

AIproj commented Apr 13, 2024

bentherien commented Apr 14, 2024

StellaAthena Apr 15, 2024

StellaAthena Apr 15, 2024 •

edited


		Default = 0.05

		Fraction of a batch dedicated to doing replay. For example, 0.1 means that in a batch of 100, 19 samples will come from the replay

Adding replay into GPT-NeoX #1200

Are you sure you want to change the base?

Adding replay into GPT-NeoX #1200

Conversation

AIproj commented Apr 13, 2024

How to use

Unsupported/untested features:

Pending tests

bentherien commented Apr 14, 2024

StellaAthena Apr 15, 2024

Choose a reason for hiding this comment

StellaAthena Apr 15, 2024 • edited

Choose a reason for hiding this comment

StellaAthena Apr 15, 2024 •

edited