Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug #6877 #6889

Closed
wants to merge 2 commits into from
Closed

fix bug #6877 #6889

wants to merge 2 commits into from

Conversation

arthasking123
Copy link

@arthasking123 arthasking123 commented May 9, 2024

fix bug #6877 due to maybe f becomes invaild after yield process
the results are below:

Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:01<00:00, 420.41it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 26148.48it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 409731.44it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 289720.84it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 26663.42it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 434056.21it/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 13222.33files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:04<00:00, 180.67files/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [01:35<00:00, 8.70files/s]
Generating train split: 1571592 examples [00:08, 176736.09 examples/s]
Generating test split: 85533 examples [00:01, 48224.56 examples/s]
Generating validation split: 86246 examples [00:01, 50164.16 examples/s]

Fix #6877.

CC: @natolambert

fix bug huggingface#6877 due to f become invaild after yield process
@arthasking123
Copy link
Author

arthasking123 commented May 10, 2024

@loicmagne, @KennethEnevoldsen

arthasking123

This comment was marked as duplicate.

@lhoestq
Copy link
Member

lhoestq commented May 10, 2024

Can you give more details on why this fix works ?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@arthasking123
Copy link
Author

arthasking123 commented May 10, 2024

Can you give more details on why this fix works ?

In order to locate this file handle problem, I defined a print_open_files_count() function using psutil library:

def print_open_files_count(markstr):
    pid = os.getpid()
    p = psutil.Process(pid)
    open_files = p.open_files()
    print(f"{markstr}_Open files count: {len(open_files)}")

and added this function as below:

with open(file, "rb") as f:
    print_open_files_count('Before')
...
...
      batch_idx += 1
print_open_files_count('After')

and the console output as below when loading the 'mteb/biblenlp-corpus-mmteb' dataset :

Before_Open files count: 1
After_Open files count: 1
Before_Open files count: 2
After_Open files count: 2
Before_Open files count: 3
After_Open files count: 3
...

which indicated there was a file handle leakage in the dataset loading process. So I tried to close the file handle manually using os library and found it works although the core issue has not been found temporarily

adjust import sequence of json.py
@lhoestq
Copy link
Member

lhoestq commented May 10, 2024

I think it would be better to find the cause and have a cleaner fix, because while your suggested fix works for a simple case, it will lead to files that will stay open if there is an error during the dataset generation for example.

Btw I was not able to reproduce locally (macbook pro m2) or on colab, so it might be something related to your environment. Also open() should close the file at the end of the with block so I don't really get how you can get this issue :/

@arthasking123
Copy link
Author

Btw I was not able to reproduce locally (macbook pro m2) or on colab, so it might be something related to your environment. Also open() should close the file at the end of the with block so I don't really get how you can get this issue :/

how about setting the limitation of open files to 1024?

@lhoestq
Copy link
Member

lhoestq commented May 13, 2024

I was able to reproduce on colab with

!ulimit -n 256 && python -c "from datasets import load_dataset; load_dataset('mteb/biblenlp-corpus-mmteb')"

(also needed to !pip install -qq git+https://github.com/huggingface/huggingface_hub.git@less-paths-info-calls to fix a rate limit for some reason)

which led to me find that the issue came from the GzipFileSystem that wasn't closing files.

to reproduce:

import gzip
import os

import datasets
import fsspec

# os.mkdir("tmp")
# for i in range(300):
#     with gzip.open(f"tmp/{i}.txt.gz", "wt") as f:
#         f.write("yo")

for i in range(300):
    with fsspec.open(f"gzip://{i}.txt::tmp/{i}.txt.gz", "rb") as f:
        f.read()

I opened #6893 to fix this, can you try if it works on your side ?

@arthasking123
Copy link
Author

arthasking123 commented May 13, 2024 via email

@albertvillanova
Copy link
Member

Superseded by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OSError: [Errno 24] Too many open files
4 participants