Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need clarification for pre-training #13

Closed
winston-zillow opened this issue Oct 31, 2018 · 5 comments · May be fixed by #1355
Closed

Need clarification for pre-training #13

winston-zillow opened this issue Oct 31, 2018 · 5 comments · May be fixed by #1355

Comments

@winston-zillow
Copy link

In the README.md, it says for the pre-training:

It is important that these be actual sentences 
for the "next sentence prediction" task

and the example sample_text.txt does have each line ends with either . or ;.

Whereas in the BERT paper, it says

... we sample two spans of text from the corpus, which we refer to as "sentences" 
even though they are typically much longer than single sentences 
(but can be shorter also)

So it becomes unclear whether this implementation does expect actual sentences per line or just documents be broken down into multiple lines arbitrarily.

@jacobdevlin-google
Copy link
Contributor

The thing that the paper refers to is happening inside of create_preprocessing_data.py. See here

So the input text file should be actual sentences, although feel free to add some noise if you want to make things more robust for fine-tuning (e.g., if your sentence segmenter always splits on ., this may create a weird bias. So arbitrarily truncating or concatenating 5% of the training data may make the fine-tuning more robust to non-sententential data. We don't have any hard numbers on this.)

For our sentence segmenter I just used some Google-internal library I found, but anything off the shelf like (SpaCy)[https://spacy.io/usage/spacy-101] should work.

@winston-zillow
Copy link
Author

I see, especially the explanation in create_preprocessing_data.py. It would be nice to mention your comments in the README, or even update the paper.

@jacobdevlin-google
Copy link
Contributor

I added a paragraph in the README about this, thanks.

@xgk
Copy link

xgk commented Nov 8, 2018

Hello @jacobdevlin-google
Really appreciate your work:) Would you mind to share us the pre-training set size in terms of instances? Even though you mentioned the corpus size is about 3.3 billion words, but when preprocess the data before feeding to the model, I found you have a setting of replicate factor of 10 by default. And based on my current experiment, I found the preprocess will enlarge the training data by about 20x of original. Not sure if I made any mistake, would you share us the actual training set size and confirm if my finds is correct?
Thanks!

@artemisart
Copy link
Contributor

@xgk are you using Chinese or some other languages where one token correspond to one char and not one word ? That would explain this size augmentation.

nishans08 added a commit to nishans08/bert that referenced this issue Jul 28, 2022
and additionally inclues Thai and Mongolian. -> and additionally includes Thai and Mongolian.

FIX google-research#13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants