New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bug #6877 #6889
fix bug #6877 #6889
Conversation
fix bug huggingface#6877 due to f become invaild after yield process
Can you give more details on why this fix works ? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
In order to locate this file handle problem, I defined a print_open_files_count() function using psutil library: def print_open_files_count(markstr):
pid = os.getpid()
p = psutil.Process(pid)
open_files = p.open_files()
print(f"{markstr}_Open files count: {len(open_files)}")
and added this function as below: with open(file, "rb") as f:
print_open_files_count('Before')
...
...
batch_idx += 1
print_open_files_count('After') and the console output as below when loading the 'mteb/biblenlp-corpus-mmteb' dataset : Before_Open files count: 1
After_Open files count: 1
Before_Open files count: 2
After_Open files count: 2
Before_Open files count: 3
After_Open files count: 3
... which indicated there was a file handle leakage in the dataset loading process. So I tried to close the file handle manually using os library and found it works although the core issue has not been found temporarily |
adjust import sequence of json.py
I think it would be better to find the cause and have a cleaner fix, because while your suggested fix works for a simple case, it will lead to files that will stay open if there is an error during the dataset generation for example. Btw I was not able to reproduce locally (macbook pro m2) or on colab, so it might be something related to your environment. Also |
how about setting the limitation of open files to 1024? |
I was able to reproduce on colab with
(also needed to which led to me find that the issue came from the to reproduce: import gzip
import os
import datasets
import fsspec
# os.mkdir("tmp")
# for i in range(300):
# with gzip.open(f"tmp/{i}.txt.gz", "wt") as f:
# f.write("yo")
for i in range(300):
with fsspec.open(f"gzip://{i}.txt::tmp/{i}.txt.gz", "rb") as f:
f.read() I opened #6893 to fix this, can you try if it works on your side ? |
ok
…---- Replied Message ----
| From | Quentin ***@***.***> |
| Date | 05/13/2024 20:28 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [huggingface/datasets] fix bug #6877 (PR #6889) |
I was able to reproduce on colab with
!ulimit -n 256 && python -c "from datasets import load_dataset; load_dataset('mteb/biblenlp-corpus-mmteb')"
(also needed to !pip install -qq ***@***.*** to fix a rate limit for some reason)
which lead to me find that the issue came from the GzipFileSystem that wasn't closing files.
to reproduce:
importgzipimportosimportdatasetsimportfsspec# os.mkdir("tmp")# for i in range(300):# with gzip.open(f"tmp/{i}.txt.gz", "wt") as f:# f.write("yo")foriinrange(300):
withfsspec.open(f"gzip://::tmp/{i}.txt.gz", "rb") asf:
f.read()
I opened #6893 to fix this, can you try if it works on your side ?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Superseded by: |
fix bug #6877 due to maybe f becomes invaild after yield process
the results are below:
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:01<00:00, 420.41it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 26148.48it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 409731.44it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 289720.84it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 26663.42it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 434056.21it/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 13222.33files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:04<00:00, 180.67files/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [01:35<00:00, 8.70files/s]
Generating train split: 1571592 examples [00:08, 176736.09 examples/s]
Generating test split: 85533 examples [00:01, 48224.56 examples/s]
Generating validation split: 86246 examples [00:01, 50164.16 examples/s]
Fix #6877.
CC: @natolambert