Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset] Dataset Generation Always Returns Cached Version #97

Open
benjaminye opened this issue Mar 27, 2024 · 1 comment
Open

[Dataset] Dataset Generation Always Returns Cached Version #97

benjaminye opened this issue Mar 27, 2024 · 1 comment
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@benjaminye
Copy link
Contributor

Describe the bug
At dataset creation, the dataset generated will always get the cached version despite change in file.

To Reproduce

  1. Run toolkit.py
  2. Ctrl-C
  3. Add a line in the dataset
  4. toolkit.py will not create a new dataset with desired changes

Expected behavior

  1. Dataset to be generated with new data

Environment:

  • OS: Ubuntu
@benjaminye benjaminye added bug Something isn't working good first issue Good for newcomers labels Mar 27, 2024
@benjaminye benjaminye self-assigned this Mar 27, 2024
@benjaminye
Copy link
Contributor Author

This is caused by huggingface Dataset.from_generator() method checking to see if dataset is cached. See code.

Easiest solution is to pass in a cache_dir parameter (like ./dataset_cache) with each Ingestor class, for example here.

That way whenever there's a change in local file, user can delete the cache directory ./dataset_cache.


Future Enhancement

  • Perhaps we can have a config no_cache under config.data, and the toolkit will go ahead and delete ./dataset_cache directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant