Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with case sensitivity when loading dataset from local cache #6763

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Sumsky21
Copy link

@Sumsky21 Sumsky21 commented Mar 28, 2024

When a dataset with upper-cases in its name is first loaded using load_dataset(), the local cache directory is created with all lowercase letters.

However, upon subsequent loads, the current version attempts to locate the cache directory using the dataset's original name, which includes uppercase letters. This discrepancy can lead to confusion and, particularly in offline mode, results in errors.

Reproduce

~$ python            
Python 3.9.19 (main, Mar 21 2024, 17:11:28) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("locuslab/TOFU", "full")
>>> quit()

~$ export HF_DATASETS_OFFLINE=1 
~$ python                      
Python 3.9.19 (main, Mar 21 2024, 17:11:28) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("locuslab/TOFU", "full")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 1871, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couldn't reach the Hugging Face Hub for dataset 'locuslab/TOFU': Offline mode is enabled.
>>> 

I fix this issue by lowering the dataset name (.lower()) when generating cache_dir.

@jhauret
Copy link

jhauret commented Apr 17, 2024

I also need this feature for "Cnam-LMSSC/vibravox "

EDIT: Upgrading to 2.19.0 fixed my problem thanks to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants