Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 24] Too many open files #6877

Closed
loicmagne opened this issue May 7, 2024 · 5 comments · Fixed by #6893
Closed

OSError: [Errno 24] Too many open files #6877

loicmagne opened this issue May 7, 2024 · 5 comments · Fixed by #6893

Comments

@loicmagne
Copy link

Describe the bug

I am trying to load the 'default' subset of the following dataset which contains lots of files (828 per split): https://huggingface.co/datasets/mteb/biblenlp-corpus-mmteb

When trying to load it using the load_dataset function I get the following error

>>> from datasets import load_dataset
>>> d = load_dataset('mteb/biblenlp-corpus-mmteb')
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████| 201k/201k [00:00<00:00, 1.07MB/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 1069.15it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 436182.33it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 2228.75it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 646478.73it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 831032.24it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 517645.51it/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:33<00:00, 24.87files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 27.48files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 26.94files/s]
Generating train split: 1571592 examples [00:03, 461438.97 examples/s]
Generating test split: 11163 examples [00:00, 118190.72 examples/s]
Traceback (most recent call last):
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1995, in _prepare_split_single
    for _, table in generator:
  File ".env/lib/python3.12/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables
    with open(file, "rb") as f:
         ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/streaming.py", line 75, in wrapper
    return function(*args, download_config=download_config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 1224, in xopen
    file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/filesystems/compression.py", line 81, in _open
    return self.file.open()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 197, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 322, in __init__
    self._open()
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 327, in _open
    self.f = open(self.path, mode=self.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/downloads/3a347186abfc0f9c924dde0221d246db758c7232c0101523f04a87c17d696618'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 981, in incomplete_dir
    yield tmp_dir
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1882, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 2038, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".env/lib/python3.12/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1007, in download_and_prepare
    with incomplete_dir(self._output_dir) as tmp_output_dir:
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 988, in incomplete_dir
    shutil.rmtree(tmp_dir)
  File "/usr/lib/python3.12/shutil.py", line 785, in rmtree
    _rmtree_safe_fd(fd, path, onexc)
  File "/usr/lib/python3.12/shutil.py", line 661, in _rmtree_safe_fd
    onexc(os.scandir, path, err)
  File "/usr/lib/python3.12/shutil.py", line 657, in _rmtree_safe_fd
    with os.scandir(topfd) as scandir_it:
         ^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/mteb___biblenlp-corpus-mmteb/default/0.0.0/3912ed967b0834547f35b2da9470c4976b357c9a.incomplete'

I looked for the maximum number of open files on my machine (Ubuntu 24.04) and it seems to be 1024, but even when I try to load a single split (load_dataset('mteb/biblenlp-corpus-mmteb', split='train')) I get the same error

Steps to reproduce the bug

from datasets import load_dataset
d = load_dataset('mteb/biblenlp-corpus-mmteb')

Expected behavior

Load the dataset without error

Environment info

  • datasets version: 2.19.0
  • Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • huggingface_hub version: 0.23.0
  • PyArrow version: 16.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.3.1
@arthasking123
Copy link

ulimit -n 8192 can solve this problem

@loicmagne
Copy link
Author

ulimit -n 8192 can solve this problem

Would there be a systematic way to do this ? The data loading is part of the MTEB library

@arthasking123
Copy link

ulimit -n 8192 can solve this problem

Would there be a systematic way to do this ? The data loading is part of the MTEB library

I think we could modify the _prepare_split_single function

arthasking123 added a commit to arthasking123/datasets that referenced this issue May 9, 2024
fix bug huggingface#6877 due to f become invaild after yield process
@lhoestq
Copy link
Member

lhoestq commented May 13, 2024

I fixed it with #6893, feel free to re-open if you're still having the issue :)

@loicmagne
Copy link
Author

I fixed it with #6893, feel free to re-open if you're still having the issue :)

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants