Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Practical scope requirements #1

Open
StoneCypher opened this issue Apr 24, 2023 · 3 comments
Open

Practical scope requirements #1

StoneCypher opened this issue Apr 24, 2023 · 3 comments

Comments

@StoneCypher
Copy link

How much code in micro-language Foo do you actually need to train one of these?

@minosvasilias
Copy link
Owner

The dataset used for the provided weights was 60k rows. Each scraped script is split into individual functions as an easy and reliable way to split code into chunks, so one function = one entry.

In practice, this resulted in 762 repositories being parsed for the training data, see godot_dodo_4x_60k_repos.json

@StoneCypher
Copy link
Author

if you were asked to stick your thumb in the air and guess, what would you expect a lower bound for practical success to be?

my language is nowhere near that common

@minosvasilias
Copy link
Owner

I would say the lower bound of dataset sizes i've seen for LLaMA finetunes in general (not code-specific) sits around 15-20k rows.

I personally trained a 20k rows 7b model initially to judge whether or not this project was worth pursuing, but don't have any evaluations for that one. Still, it showed good enough results to continue, so that would be the sort of minimum i'd be looking at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants