Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error when training on a 220G Memory machine with 8 V100. #867

Open
SefaZeng opened this issue Apr 2, 2023 · 2 comments
Open

OOM error when training on a 220G Memory machine with 8 V100. #867

SefaZeng opened this issue Apr 2, 2023 · 2 comments
Labels
feature request New feature or request

Comments

@SefaZeng
Copy link

SefaZeng commented Apr 2, 2023

Is your feature request related to a problem? Please describe.
I tried to train a model with my own data about 300G after binarized. And there will be an OOM error for this. Is there any parameter that can be used for loading part of the data in a stream way?

@SefaZeng SefaZeng added the feature request New feature or request label Apr 2, 2023
@Quentin-Anthony
Copy link
Member

There is not. I recommend you shard your dataset like the Pile (https://the-eye.eu/public/AI/pile/train/) and then unpack and train on one shard at a time.

@SefaZeng
Copy link
Author

SefaZeng commented Apr 3, 2023

There is not. I recommend you shard your dataset like the Pile (https://the-eye.eu/public/AI/pile/train/) and then unpack and train on one shard at a time.

Thank you for your reply. Do you mean to change the train data path from

    "train-data-paths": [
        "/data/pile-train_text_document",
        "/data/clue_pretrain_0-train_text_document",
    ],
    "train-data-weights": [40.0, 60.0],

to

    "train-data-paths": [
        "/data/pile-00-train_text_document",
        "/data/pile-01-train_text_document",
        ...
        "/data/clue_pretrain_0-train_text_document",
    ],
    "train-data-weights": [40.0, 40.0, ..., 60.0],

?
I thought the gpt-neox will load all the data in the train-data-paths into the memory one time. I can shard the data into multiple shards but I am not sure how to load one shard at a time. Is there any example?
Thanks for your kindly response again!!! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants