Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do OpenELM's training datasets contain copyrighted material? #27

Open
jfahrenkrug opened this issue May 1, 2024 · 0 comments
Open

Do OpenELM's training datasets contain copyrighted material? #27

jfahrenkrug opened this issue May 1, 2024 · 0 comments

Comments

@jfahrenkrug
Copy link

I'm very excited about the release of this model and the efforts the team went through to openly document seemingly every aspect of it. Thank you!

I wonder if any information can be given concerning the selection of training datasets. On https://machinelearning.apple.com/research/openelm it says:

our release includes the complete framework for training and evaluation of the language model on publicly available datasets

More specifically, on https://github.com/apple/corenet/blob/main/projects/openelm/README-pretraining.md it says:

OpenELM was pretrained on public datasets. Specifically, our pre-training dataset contains RefinedWeb, PILE, a subset of RedPajama, and a subset of Dolma v1.6.

Digging into RefinedWeb on https://huggingface.co/datasets/tiiuae/falcon-refinedweb/viewer/default/train?q=nytimes.com, it contains content from sources like nytimes.com and cnn.com.

Screenshot 2024-05-01 at 11 01 57 AM

This is not surprising: Because of the vast amounts of data needed to train LLMs (basically a snapshot of the internet), all training datasets will contain copyrighted material. LLMs are sort of a snapshot of humanity’s knowledge. References to copyrighted characters like Superman, Captain Kirk, Donald Duck, Bugs Bunny etc etc are part of that collective knowledge and references to them might pop up just about anywhere in a dataset. Getting a snapshot of humanity’s knowledge that is free of such references would be as impossible as removing the sugar from a cake after it has been baked.

So while the project only mentions "publicly available datasets" and never makes any claims to be "free of copyrighted material", can any information be shared about the selection process that went into choosing the datasets that were used to train OpenELM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant