Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset with data_dir and cache_dir set fail with not supported #6886

Open
fah opened this issue May 8, 2024 · 0 comments
Open

load_dataset with data_dir and cache_dir set fail with not supported #6886

fah opened this issue May 8, 2024 · 0 comments

Comments

@fah
Copy link

fah commented May 8, 2024

Describe the bug

with python 3.11 I execute:

from transformers import Wav2Vec2Processor, Data2VecAudioModel
import torch
from torch import nn
from datasets import load_dataset, concatenate_datasets

# load demo audio and set processor
dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache")

This fails in the last line with

Found cached dataset librispeech_asr (file:///Users/as/Documents/Project/git/audio2vec/cache/librispeech_asr/clean-data_dir=data/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7)
Traceback (most recent call last):
  File "/Users/as/Documents/Project/git/audio2vec/src/music2vec-v1.py", line 7, in <module>
    dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 1810, in load_dataset
    ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1113, in as_dataset
    raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

Steps to reproduce the bug

I setup an venv with requirements.txt

transformers==4.40.2
torch==2.2.2
datasets==2.16.0
fsspec==2023.9.2

pip freeze is:

aiohttp==3.9.5
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.16.0
dill==0.3.7
filelock==3.14.0
frozenlist==1.4.1
fsspec==2023.9.2
huggingface-hub==0.23.0
idna==3.7
Jinja2==3.1.4
MarkupSafe==2.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
networkx==3.3
numpy==1.26.4
packaging==24.0
pandas==2.2.2
pyarrow==16.0.0
pyarrow-hotfix==0.6
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.4.28
requests==2.31.0
safetensors==0.4.3
six==1.16.0
sympy==1.12
tokenizers==0.19.1
torch==2.2.2
tqdm==4.66.4
transformers==4.40.2
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4

I execute this on a M1 Mac.

Expected behavior

I don't understand the error message. Why is "local" caching not supported. Would it possible to give some additional hint with the error message how to solve this issue?

Environment info

source ....
python -u example.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant