Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use cached dataset without Internet connection (or when servers are down) #6837

Open
DionisMuzenitov opened this issue Apr 25, 2024 · 3 comments

Comments

@DionisMuzenitov
Copy link

Describe the bug

I want to be able to use cached dataset from HuggingFace even when I have no Internet connection (or when HuggingFace servers are down, or my company has network issues).
The problem why I can't use it:
data_files argument from datasets.load_dataset() function get it updates from the server before calculating hash for caching. As a result, when I run the same code with and without Internet I get different dataset configuration directory name.

Steps to reproduce the bug

import datasets

c4_dataset = datasets.load_dataset(
    path="allenai/c4",
    data_files={"train": "en/c4-train.00000-of-01024.json.gz"},
    split="train",
    cache_dir="/datesets/cache",
    download_mode="reuse_cache_if_exists",
    token=False,
)
  1. Run this code with the Internet.
  2. Run the same code without the Internet.

Expected behavior

When running without the Internet connection, the loader should be able to get dataset from cache

Environment info

  • datasets version: 2.19.0
  • Platform: Windows-10-10.0.19044-SP0
  • Python version: 3.10.13
  • huggingface_hub version: 0.22.2
  • PyArrow version: 16.0.0
  • Pandas version: 1.5.3
  • fsspec version: 2023.12.2
@DionisMuzenitov
Copy link
Author

There are 2 workarounds, tho:

  1. Download datasets from web and just load them locally
  2. Use metadata directly (temporal solution, since metadata can change)
import datasets
from datasets.data_files import DataFilesDict, DataFilesList

data_files_list = DataFilesList(
    [
        "hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00000-of-01024.json.gz"
    ],
    [("allenai/c4", "1588ec454efa1a09f29cd18ddd04fe05fc8653a2")],
)
data_files = DataFilesDict({"train": data_files_list})
c4_dataset = datasets.load_dataset(
    path="allenai/c4",
    data_files=data_files,
    split="train",
    cache_dir="/datesets/cache",
    download_mode="reuse_cache_if_exists",
    token=False,
)

Second solution also shows where to find the bug. I suggest that the hashing functions should always use only original parameter data_files, and not the one they get after connecting to the server and creating DataFilesDict

@mariosasko
Copy link
Collaborator

Hi! You need to set the HF_DATASETS_OFFLINE env variable to 1 to load cached datasets offline, as explained in the docs here.

@DionisMuzenitov
Copy link
Author

DionisMuzenitov commented Apr 26, 2024

Just tested. It doesn't work, because of the exact problem I described above: hash of dataset config is different.
The only error difference is the reason why it cannot connect to HuggingFace (now it's 'offline mode is enabled')
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants