Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a remote dataset fails in the last release (v2.19.0) #6827

Open
zrthxn opened this issue Apr 19, 2024 · 0 comments
Open

Loading a remote dataset fails in the last release (v2.19.0) #6827

zrthxn opened this issue Apr 19, 2024 · 0 comments

Comments

@zrthxn
Copy link

zrthxn commented Apr 19, 2024

While loading a dataset with multiple splits I get an error saying Couldn't find file at <URL>

I am loading the dataset like so, nothing out of the ordinary.
This dataset needs a token to access it.

token="hf_myhftoken-sdhbdsjgkhbd"
load_dataset("speechcolab/gigaspeech", "test", cache_dir=f"gigaspeech/test", token=token)

I get the following error
Screenshot 2024-04-19 at 11 03 07 PM

Now you can see that the URL that it is trying to reach has the JSON object of the dataset split appended to the base URL. I think this may be due to a newly introduced issue.

I did not have this issue with the previous version of the datasets. Everything was fine for me yesterday and after the release 12 hours ago, this seems to have broken. Also, the dataset in question runs custom code and I checked and there have been no commits to the dataset on Huggingface in 6 months.

Steps to reproduce the bug

Since this happened with one particular dataset for me, I am listing steps to use that dataset.

  1. Open https://huggingface.co/datasets/speechcolab/gigaspeech and fill the form to get access.
  2. Create a token on your huggingface account with read access.
  3. Run the following line, substituing <your_token_here> with your token.
load_dataset("speechcolab/gigaspeech", "test", cache_dir=f"gigaspeech/test", token="<your_token_here>")

Expected behavior

Be able to load the dataset in question.

Environment info

datasets == 2.19.0
python == 3.10
kernel == Linux 6.1.58+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant