Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 6598: load_dataset broken for data_files on s3 #6862

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

matstrand
Copy link

@matstrand matstrand commented May 3, 2024

Fixes /issues/6598

I've added a new test case and a solution. Before applying the solution the test case was failing with the same error described in the linked issue. I encountered this issue while following the Hugging Face documentation, trying to perform GPT-2 fine-tuning using run_clm.py on SageMaker with a data file stored on S3.

MRE:

pip install "datasets[s3]"
python -c "from datasets import load_dataset; load_dataset('csv', data_files={'train': 's3://noaa-gsod-pds/2024/A5125600451.csv'})"

@matstrand matstrand force-pushed the issue-6598-load-dataset-broken-s3 branch 4 times, most recently from bf3c8c2 to 88cc5a3 Compare May 3, 2024 02:06
@matstrand matstrand force-pushed the issue-6598-load-dataset-broken-s3 branch from c3c0c06 to ded2cac Compare May 3, 2024 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected keyword argument 'hf' when downloading CSV dataset from S3
1 participant