OSError: [Errno 24] Too many open files #6877

loicmagne · 2024-05-07T01:15:09Z

Describe the bug

I am trying to load the 'default' subset of the following dataset which contains lots of files (828 per split): https://huggingface.co/datasets/mteb/biblenlp-corpus-mmteb

When trying to load it using the load_dataset function I get the following error

>>> from datasets import load_dataset
>>> d = load_dataset('mteb/biblenlp-corpus-mmteb')
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████| 201k/201k [00:00<00:00, 1.07MB/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 1069.15it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 436182.33it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 2228.75it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 646478.73it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 831032.24it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 517645.51it/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:33<00:00, 24.87files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 27.48files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 26.94files/s]
Generating train split: 1571592 examples [00:03, 461438.97 examples/s]
Generating test split: 11163 examples [00:00, 118190.72 examples/s]
Traceback (most recent call last):
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1995, in _prepare_split_single
    for _, table in generator:
  File ".env/lib/python3.12/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables
    with open(file, "rb") as f:
         ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/streaming.py", line 75, in wrapper
    return function(*args, download_config=download_config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 1224, in xopen
    file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/filesystems/compression.py", line 81, in _open
    return self.file.open()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 197, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 322, in __init__
    self._open()
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 327, in _open
    self.f = open(self.path, mode=self.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/downloads/3a347186abfc0f9c924dde0221d246db758c7232c0101523f04a87c17d696618'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 981, in incomplete_dir
    yield tmp_dir
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1882, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 2038, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".env/lib/python3.12/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1007, in download_and_prepare
    with incomplete_dir(self._output_dir) as tmp_output_dir:
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 988, in incomplete_dir
    shutil.rmtree(tmp_dir)
  File "/usr/lib/python3.12/shutil.py", line 785, in rmtree
    _rmtree_safe_fd(fd, path, onexc)
  File "/usr/lib/python3.12/shutil.py", line 661, in _rmtree_safe_fd
    onexc(os.scandir, path, err)
  File "/usr/lib/python3.12/shutil.py", line 657, in _rmtree_safe_fd
    with os.scandir(topfd) as scandir_it:
         ^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/mteb___biblenlp-corpus-mmteb/default/0.0.0/3912ed967b0834547f35b2da9470c4976b357c9a.incomplete'

I looked for the maximum number of open files on my machine (Ubuntu 24.04) and it seems to be 1024, but even when I try to load a single split (load_dataset('mteb/biblenlp-corpus-mmteb', split='train')) I get the same error

Steps to reproduce the bug

from datasets import load_dataset
d = load_dataset('mteb/biblenlp-corpus-mmteb')

Expected behavior

Load the dataset without error

Environment info

datasets version: 2.19.0
Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Python version: 3.12.3
huggingface_hub version: 0.23.0
PyArrow version: 16.0.0
Pandas version: 2.2.2
fsspec version: 2024.3.1

The text was updated successfully, but these errors were encountered:

arthasking123 · 2024-05-07T03:42:17Z

ulimit -n 8192 can solve this problem

loicmagne · 2024-05-07T09:16:58Z

ulimit -n 8192 can solve this problem

Would there be a systematic way to do this ? The data loading is part of the MTEB library

arthasking123 · 2024-05-08T10:03:16Z

ulimit -n 8192 can solve this problem

Would there be a systematic way to do this ? The data loading is part of the MTEB library

I think we could modify the _prepare_split_single function

fix bug huggingface#6877 due to f become invaild after yield process

lhoestq · 2024-05-13T13:02:19Z

I fixed it with #6893, feel free to re-open if you're still having the issue :)

loicmagne · 2024-05-13T15:36:07Z

I fixed it with #6893, feel free to re-open if you're still having the issue :)

Thanks a lot!

loicmagne mentioned this issue May 7, 2024

fix: Convert Multilingual/Crosslingual to fast-loading format embeddings-benchmark/mteb#635

Merged

19 tasks

loicmagne mentioned this issue May 8, 2024

New fast loader for Bitext Mining parallel corpora embeddings-benchmark/mteb#651

Closed

arthasking123 added a commit to arthasking123/datasets that referenced this issue May 9, 2024

fix bug huggingface#6877

b851de9

fix bug huggingface#6877 due to f become invaild after yield process

arthasking123 mentioned this issue May 9, 2024

fix bug #6877 #6889

Closed

lhoestq mentioned this issue May 13, 2024

Close gzipped files properly #6893

Merged

lhoestq closed this as completed in #6893 May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError: [Errno 24] Too many open files #6877

OSError: [Errno 24] Too many open files #6877

loicmagne commented May 7, 2024

arthasking123 commented May 7, 2024

loicmagne commented May 7, 2024

arthasking123 commented May 8, 2024

lhoestq commented May 13, 2024

loicmagne commented May 13, 2024

OSError: [Errno 24] Too many open files #6877

OSError: [Errno 24] Too many open files #6877

Comments

loicmagne commented May 7, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

arthasking123 commented May 7, 2024

loicmagne commented May 7, 2024

arthasking123 commented May 8, 2024

lhoestq commented May 13, 2024

loicmagne commented May 13, 2024