Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLamaTokenizer with use_fast=True / and use_fast=False causing memory leak when used with multiprocessing / dataset.map(num_proc) #1495

Open
michaelfeil opened this issue Apr 15, 2024 · 5 comments

Comments

@michaelfeil
Copy link

michaelfeil commented Apr 15, 2024

When running a dataset.map with num_proc=16, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.

The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..

Other things I have tried:

  • I have tried creating e.g. 16 tokenizers in global scope and accessing them via the rank parameter.
  • gc.collect'
  • not usage of use_fast makes the script more efficent - it takes now ~10k lines instead of 2k to go OOM'
  • use of AutoTokenzier,

Reproduction script

import datasets
from transformers import LlamaTokenizerFast, AutoTokenizer
import gc
N_PROCS = 16

tokenizer_tinyllama = None

def tokenize(example, rank: int = 0):
    global tokenizer_tinyllama
    
    # gc.collect()
    if tokenizer_tinyllama is None:
        tokenizer_tinyllama = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
    
    example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
    example["n_tokens"] = len(example["input_ids"])
    example["content"] = None
    return example

def main():
    
    books3 = datasets.load_dataset("michael/books3_128k", streaming=False, keep_in_memory=False) # jsonl file, around 45GB in jsonl
    # books3 = books3.shuffle()
    
    books3_updated = books3["train"].map(
        tokenize,
        num_proc=N_PROCS,
        with_rank=True,
    )
    books3_updated.push_to_hub(
        "michael/books3_128k_tokenized"
    )
    
        
if __name__ == "__main__":
    main()

Env

OS: Ubuntu 22.04

PIP freeze

aiohttp==3.9.4
aiosignal==1.3.1
async-timeout==4.0.3
attrs==21.2.0
Automat==20.2.0
Babel==2.8.0
bcrypt==3.2.0
blinker==1.4
certifi==2020.6.20
chardet==4.0.0
click==8.0.3
cloud-init==23.4.4
colorama==0.4.4
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==3.4.8
datasets==2.18.0
dbus-python==1.2.18
decorator==4.4.2
devscripts===2.22.1ubuntu1
dill==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
filelock==3.13.4
frozenlist==1.4.1
fsspec==2024.2.0
gpg==1.16.0
hf_transfer==0.1.6
httplib2==0.20.2
huggingface-hub==0.22.2
hyperlink==21.0.0
idna==3.3
importlib-metadata==4.6.4
incremental==21.3.0
jeepney==0.7.1
Jinja2==3.0.3
jsonpatch==1.32
jsonpointer==2.0
jsonschema==3.2.0
keyring==23.5.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
MarkupSafe==2.0.1
more-itertools==8.10.0
multidict==6.0.5
multiprocess==0.70.16
netifaces==0.11.0
numpy==1.26.4
oauthlib==3.2.0
packaging==24.0
pandas==2.2.2
pexpect==4.8.0
protobuf==5.26.1
ptyprocess==0.7.0
pyarrow==15.0.2
pyarrow-hotfix==0.6
pyasn1==0.4.8
pyasn1-modules==0.2.1
PyGObject==3.42.1
PyHamcrest==2.0.2
PyJWT==2.3.0
pyOpenSSL==21.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pyserial==3.5
python-apt==2.4.0+ubuntu3
python-dateutil==2.9.0.post0
python-debian==0.1.43+ubuntu1.1
python-linux-procfs==0.6.3
python-magic==0.4.24
pytz==2022.1
pyudev==0.22.0
pyxdg==0.27
PyYAML==5.4.1
regex==2023.12.25
requests==2.25.1
safetensors==0.4.3
screen-resolution-extra==0.0.0
SecretStorage==3.3.1
sentencepiece==0.2.0
service-identity==18.1.0
six==1.16.0
sos==4.5.6
ssh-import-id==5.11
systemd-python==234
tokenizers==0.15.2
tqdm==4.66.2
transformers==4.39.3
Twisted==22.1.0
typing_extensions==4.11.0
tzdata==2024.1
ubuntu-advantage-tools==8001
ufw==0.36.1
unattended-upgrades==0.1
unidiff==0.5.5
urllib3==1.26.5
wadllib==1.3.6
xdg==5
xkit==0.0.0
xxhash==3.4.1
yarl==1.9.4
zipp==1.0.0
zope.interface==5.4.0
@michaelfeil michaelfeil changed the title LLamaTokenizer with use_fast / and use_slow causing memory leak when used with mutliprocessing LLamaTokenizer with use_fast / and use_slow causing memory leak when used with multiprocessing / dataset.map(num_proc) Apr 15, 2024
@michaelfeil michaelfeil changed the title LLamaTokenizer with use_fast / and use_slow causing memory leak when used with multiprocessing / dataset.map(num_proc) LLamaTokenizer with use_fast=True / and use_fast=False causing memory leak when used with multiprocessing / dataset.map(num_proc) Apr 15, 2024
@michaelfeil
Copy link
Author

Update, the following function does not seem to have such a behavior.

def tokenize(example, rank: int = 0):
    # global tokenizer_tinyllama
    
    gc.collect()
    # chat = [
    #     {"role": "user", "content": book},
    # ]    
    # tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
    # if tokenizer_tinyllama is None:
    tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
    
    example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
    example["n_tokens"] = len(example["input_ids"])
    example["content"] = None
    return example

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 16, 2024
@michaelfeil
Copy link
Author

No, not stale!

@github-actions github-actions bot removed the Stale label May 17, 2024
@noamgai21
Copy link

noamgai21 commented May 22, 2024

I also encounter a similar issue with 0.19.1.

@noamgai21
Copy link

noamgai21 commented May 23, 2024

Opened a new issue with a more general reproduction, I believe this is a more common problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants