Configuration-based use of HF hub-hosted datasets for training #701

chimezie · 2024-04-20T15:10:03Z

Per the title, allow a structured hf_dataset YAML configuration parameter for specifying an HF hub-hosted dataset (via name) to use with training and the ability to use datasets' local file system caching, named splits, named configurations (via configuration), split slicing syntax for specifying train, validation, and test datasets, etc.

The dataset feature names that correspond to attributes or keys for prompt/completion value pairs in the datasets or those with single pure text values (via text_feature in that case) can be specified.

Added YAML parameters, for example (train on the first 1K in the train split and validate with the last 100 (no test data set):

hf_dataset:
  name: "billsum"
  train_split: "train[:1000]"
  valid_split: "train[-100:]"
  prompt_feature: "text"
  completion_feature: "summary"

See: Splits and Configurations, billsum, & HF Dataset API

…LoRA training

chimezie · 2024-04-20T16:35:16Z

Motivated by need to reproduce #620 with an open dataset

chimezie added 7 commits April 20, 2024 10:35

Add hf_dataset configuration for using HF hub-hosted datasets for (Q)…

7d3bdc9

…LoRA training

Pre-commit formatting

5bdb061

Merge branch 'ml-explore:main' into hf_datasets

be2271a

Fix YAML config example

3805cb0

Merge remote-tracking branch 'origin/hf_datasets' into hf_datasets

3b008d5

Print DS info

0db11ef

Include name

81fab48

chimezie changed the title ~~Support for configuration-based use of HF hub-hosted datasets for training~~ Configuration-based use of HF hub-hosted datasets for training Apr 21, 2024

chimezie added 8 commits April 22, 2024 10:24

Merge branch 'ml-explore:main' into hf_datasets

7483b50

Add hf_dataset parameter default

ce82b35

Merge remote-tracking branch 'origin/hf_datasets' into hf_datasets

3536b51

Merge branch 'ml-explore:main' into hf_datasets

5f18f58

Merge branch 'ml-explore:main' into hf_datasets

5ee28d1

Merge branch 'ml-explore:main' into hf_datasets

ec00033

Merge branch 'ml-explore:main' into hf_datasets

c6f0407

Merge branch 'ml-explore:main' into hf_datasets

22ff45a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration-based use of HF hub-hosted datasets for training #701

Configuration-based use of HF hub-hosted datasets for training #701

chimezie commented Apr 20, 2024 •

edited

chimezie commented Apr 20, 2024

Configuration-based use of HF hub-hosted datasets for training #701

Are you sure you want to change the base?

Configuration-based use of HF hub-hosted datasets for training #701

Conversation

chimezie commented Apr 20, 2024 • edited

chimezie commented Apr 20, 2024

chimezie commented Apr 20, 2024 •

edited