Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExpectedMoreSplits error on load_dataset when upgrading to 2.19.0 #6836

Open
ebsmothers opened this issue Apr 24, 2024 · 3 comments
Open

ExpectedMoreSplits error on load_dataset when upgrading to 2.19.0 #6836

ebsmothers opened this issue Apr 24, 2024 · 3 comments

Comments

@ebsmothers
Copy link

Describe the bug

Hi there, thanks for the great library! We have been using it a lot in torchtune and it's been a huge help for us.

Regarding the bug: the same call to load_dataset errors with ExpectedMoreSplits in 2.19.0 after working fine in 2.18.0. Full details given in the repro below.

Steps to reproduce the bug

On 2.18.0, things work fine:

# First clear the locally cached dataset
rm -r ~/.cache/huggingface/datasets/lvwerra___stack-exchange-paired
pip install "datasets==2.18.0"
python3
>>> from datasets import load_dataset
>>> dataset = load_dataset('lvwerra/stack-exchange-paired', split='train', data_dir='data/rl')

On 2.19.0, they do not:

# First clear the locally cached dataset
rm -r ~/.cache/huggingface/datasets/lvwerra___stack-exchange-paired
pip install "datasets==2.19.0"
python3
>>> from datasets import load_dataset
>>> dataset = load_dataset('lvwerra/stack-exchange-paired', split='train', data_dir='data/rl')

The stack trace I see from the 2.19.0 version of load_dataset can be seen here.

(Maybe unsurprising but) notably if I do not delete the cache first I am able to load the dataset successfully. So based on this I suspect the cause is somewhere in the download logic.

Expected behavior

Download the dataset successfully :)

Environment info

  • datasets version: 2.19.0
  • Platform: Linux-5.12.0-0_fbk16_zion_7661_geb00762ce6d2-x86_64-with-glibc2.34
  • Python version: 3.11.9
  • huggingface_hub version: 0.22.2
  • PyArrow version: 16.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.3.1
@relic-yuexi
Copy link

Get same error on same datasets too.

@jxmsML
Copy link

jxmsML commented May 2, 2024

+1

@whwhwwhh
Copy link

same error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants