fix bug #6877 #6889

arthasking123 · 2024-05-09T13:38:40Z

fix bug #6877 due to maybe f becomes invaild after yield process
the results are below:

Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:01<00:00, 420.41it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 26148.48it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 409731.44it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 289720.84it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 26663.42it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 434056.21it/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 13222.33files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [00:04<00:00, 180.67files/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 828/828 [01:35<00:00, 8.70files/s]
Generating train split: 1571592 examples [00:08, 176736.09 examples/s]
Generating test split: 85533 examples [00:01, 48224.56 examples/s]
Generating validation split: 86246 examples [00:01, 50164.16 examples/s]

Fix #6877.

CC: @natolambert

fix bug huggingface#6877 due to f become invaild after yield process

arthasking123 · 2024-05-10T02:11:19Z

@loicmagne, @KennethEnevoldsen

lhoestq · 2024-05-10T13:52:28Z

Can you give more details on why this fix works ?

HuggingFaceDocBuilderDev · 2024-05-10T13:54:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

arthasking123 · 2024-05-10T14:34:32Z

Can you give more details on why this fix works ?

In order to locate this file handle problem, I defined a print_open_files_count() function using psutil library:

def print_open_files_count(markstr):
    pid = os.getpid()
    p = psutil.Process(pid)
    open_files = p.open_files()
    print(f"{markstr}_Open files count: {len(open_files)}")

and added this function as below:

with open(file, "rb") as f:
    print_open_files_count('Before')
...
...
      batch_idx += 1
print_open_files_count('After')

and the console output as below when loading the 'mteb/biblenlp-corpus-mmteb' dataset :

Before_Open files count: 1
After_Open files count: 1
Before_Open files count: 2
After_Open files count: 2
Before_Open files count: 3
After_Open files count: 3
...

which indicated there was a file handle leakage in the dataset loading process. So I tried to close the file handle manually using os library and found it works although the core issue has not been found temporarily

adjust import sequence of json.py

lhoestq · 2024-05-10T15:20:00Z

I think it would be better to find the cause and have a cleaner fix, because while your suggested fix works for a simple case, it will lead to files that will stay open if there is an error during the dataset generation for example.

Btw I was not able to reproduce locally (macbook pro m2) or on colab, so it might be something related to your environment. Also open() should close the file at the end of the with block so I don't really get how you can get this issue :/

arthasking123 · 2024-05-11T00:13:14Z

Btw I was not able to reproduce locally (macbook pro m2) or on colab, so it might be something related to your environment. Also open() should close the file at the end of the with block so I don't really get how you can get this issue :/

how about setting the limitation of open files to 1024?

lhoestq · 2024-05-13T12:27:59Z

I was able to reproduce on colab with

!ulimit -n 256 && python -c "from datasets import load_dataset; load_dataset('mteb/biblenlp-corpus-mmteb')"

(also needed to !pip install -qq git+https://github.com/huggingface/huggingface_hub.git@less-paths-info-calls to fix a rate limit for some reason)

which led to me find that the issue came from the GzipFileSystem that wasn't closing files.

to reproduce:

import gzip
import os

import datasets
import fsspec

# os.mkdir("tmp")
# for i in range(300):
#     with gzip.open(f"tmp/{i}.txt.gz", "wt") as f:
#         f.write("yo")

for i in range(300):
    with fsspec.open(f"gzip://{i}.txt::tmp/{i}.txt.gz", "rb") as f:
        f.read()

I opened #6893 to fix this, can you try if it works on your side ?

arthasking123 · 2024-05-13T12:31:54Z

ok

…

---- Replied Message ---- | From | Quentin ***@***.***> | | Date | 05/13/2024 20:28 | | To | ***@***.***> | | Cc | ***@***.***>***@***.***> | | Subject | Re: [huggingface/datasets] fix bug #6877 (PR #6889) | I was able to reproduce on colab with !ulimit -n 256 && python -c "from datasets import load_dataset; load_dataset('mteb/biblenlp-corpus-mmteb')" (also needed to !pip install -qq ***@***.*** to fix a rate limit for some reason) which lead to me find that the issue came from the GzipFileSystem that wasn't closing files. to reproduce: importgzipimportosimportdatasetsimportfsspec# os.mkdir("tmp")# for i in range(300):# with gzip.open(f"tmp/{i}.txt.gz", "wt") as f:# f.write("yo")foriinrange(300): withfsspec.open(f"gzip://::tmp/{i}.txt.gz", "rb") asf: f.read() I opened #6893 to fix this, can you try if it works on your side ? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

albertvillanova · 2024-05-13T13:35:32Z

Superseded by:

Close gzipped files properly #6893

fix bug huggingface#6877

b851de9

fix bug huggingface#6877 due to f become invaild after yield process

This comment was marked as duplicate.

Sign in to view

adjust import sequence of json.py

f67f4f5

adjust import sequence of json.py

albertvillanova mentioned this pull request May 13, 2024

Close gzipped files properly #6893

Merged

albertvillanova closed this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug #6877 #6889

fix bug #6877 #6889

arthasking123 commented May 9, 2024 •

edited

arthasking123 commented May 10, 2024 •

edited

This comment was marked as duplicate.

lhoestq commented May 10, 2024

HuggingFaceDocBuilderDev commented May 10, 2024

arthasking123 commented May 10, 2024 •

edited

lhoestq commented May 10, 2024 •

edited

arthasking123 commented May 11, 2024

lhoestq commented May 13, 2024 •

edited

arthasking123 commented May 13, 2024 via email

albertvillanova commented May 13, 2024

fix bug #6877 #6889

fix bug #6877 #6889

Conversation

arthasking123 commented May 9, 2024 • edited

arthasking123 commented May 10, 2024 • edited

This comment was marked as duplicate.

lhoestq commented May 10, 2024

HuggingFaceDocBuilderDev commented May 10, 2024

arthasking123 commented May 10, 2024 • edited

lhoestq commented May 10, 2024 • edited

arthasking123 commented May 11, 2024

lhoestq commented May 13, 2024 • edited

arthasking123 commented May 13, 2024 via email

albertvillanova commented May 13, 2024

arthasking123 commented May 9, 2024 •

edited

arthasking123 commented May 10, 2024 •

edited

arthasking123 commented May 10, 2024 •

edited

lhoestq commented May 10, 2024 •

edited

lhoestq commented May 13, 2024 •

edited