Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration-based use of HF hub-hosted datasets for training #701

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

chimezie
Copy link
Contributor

@chimezie chimezie commented Apr 20, 2024

Per the title, allow a structured hf_dataset YAML configuration parameter for specifying an HF hub-hosted dataset (via name) to use with training and the ability to use datasets' local file system caching, named splits, named configurations (via configuration), split slicing syntax for specifying train, validation, and test datasets, etc.

The dataset feature names that correspond to attributes or keys for prompt/completion value pairs in the datasets or those with single pure text values (via text_feature in that case) can be specified.

Added YAML parameters, for example (train on the first 1K in the train split and validate with the last 100 (no test data set):

hf_dataset:
  name: "billsum"
  train_split: "train[:1000]"
  valid_split: "train[-100:]"
  prompt_feature: "text"
  completion_feature: "summary"

See: Splits and Configurations, billsum, & HF Dataset API

@chimezie
Copy link
Contributor Author

Motivated by need to reproduce #620 with an open dataset

@chimezie chimezie changed the title Support for configuration-based use of HF hub-hosted datasets for training Configuration-based use of HF hub-hosted datasets for training Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant