New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLamaTokenizer with use_fast=True
/ and use_fast=False
causing memory leak when used with multiprocessing / dataset.map(num_proc)
#1495
Comments
michaelfeil
changed the title
LLamaTokenizer with use_fast / and use_slow causing memory leak when used with mutliprocessing
LLamaTokenizer with use_fast / and use_slow causing memory leak when used with multiprocessing / Apr 15, 2024
dataset.map(num_proc)
michaelfeil
changed the title
LLamaTokenizer with use_fast / and use_slow causing memory leak when used with multiprocessing /
LLamaTokenizer with Apr 15, 2024
dataset.map(num_proc)
use_fast=True
/ and use_fast=False
causing memory leak when used with multiprocessing / dataset.map(num_proc)
Update, the following function does not seem to have such a behavior. def tokenize(example, rank: int = 0):
# global tokenizer_tinyllama
gc.collect()
# chat = [
# {"role": "user", "content": book},
# ]
# tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
# if tokenizer_tinyllama is None:
tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
example["input_ids"] = tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
example["n_tokens"] = len(example["input_ids"])
example["content"] = None
return example |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
No, not stale! |
I also encounter a similar issue with 0.19.1. |
Opened a new issue with a more general reproduction, I believe this is a more common problem. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When running a dataset.map with
num_proc=16
, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..
Other things I have tried:
16 tokenizers
in global scope and accessing them via therank
parameter.gc.collect
'use_fast
makes the script more efficent - it takes now ~10k lines instead of 2k to go OOM'Reproduction script
Env
OS: Ubuntu 22.04
PIP freeze
The text was updated successfully, but these errors were encountered: