Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on creating instruction data #13

Open
henryhungle opened this issue Mar 6, 2023 · 1 comment
Open

Questions on creating instruction data #13

henryhungle opened this issue Mar 6, 2023 · 1 comment

Comments

@henryhungle
Copy link

Thanks for the great work!

I have a few questions regarding data creation of xP3 after following the guide here to create instruction data on the code language subset.

  1. I noticed the total samples of the public processed data (from here) on the code split is 2707724. However, my resulting data following the above github guide is much more than that (approximately >3M samples). I wonder if there were any additional post-processing to get the final instruction data for tuning?

  2. Following the above github guide, I noticed there was no prompt for this particular dataset State Changes. I got this warning when running the creation code:
    Tried instantiating `DatasetTemplates` for Fraser/python-state-changes, but no prompts found. Please ignore this warning if you are creating new prompts for this dataset.

Is this dataset not assigned with any prompt (similar to how HumanEval was treated). Or is the below version of PromptSource I used is not correct:
git clone -b tr13 https://github.com/Muennighoff/promptsource.git & install cd promptsource; pip install -e .

@Muennighoff
Copy link
Collaborator

Muennighoff commented Mar 6, 2023

Hey, thanks for the thorough investigation!

  1. This could be due to the merging of the files. When you load from https://huggingface.co/datasets/bigscience/xP3all it loads a file called merged.jsonl for each directory, which are all individual jsonl files merged and deduplicated (https://github.com/bigscience-workshop/bigscience/blob/57086158464c4e514e8e9e3d6f77eed4865e20e4/data/xp3/xp3_jsonl_to_meg.slurm#L80).
  2. Good point python-state-changes did not make it into the dataset - Not sure why. You could write some prompts for it & add it for your dataset. Your promptsource version is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants