Need clarification for pre-training #13

winston-zillow · 2018-10-31T23:18:06Z

In the README.md, it says for the pre-training:

It is important that these be actual sentences 
for the "next sentence prediction" task

and the example sample_text.txt does have each line ends with either . or ;.

Whereas in the BERT paper, it says

... we sample two spans of text from the corpus, which we refer to as "sentences" 
even though they are typically much longer than single sentences 
(but can be shorter also)

So it becomes unclear whether this implementation does expect actual sentences per line or just documents be broken down into multiple lines arbitrarily.

The text was updated successfully, but these errors were encountered:

jacobdevlin-google · 2018-10-31T23:26:53Z

The thing that the paper refers to is happening inside of create_preprocessing_data.py. See here

So the input text file should be actual sentences, although feel free to add some noise if you want to make things more robust for fine-tuning (e.g., if your sentence segmenter always splits on ., this may create a weird bias. So arbitrarily truncating or concatenating 5% of the training data may make the fine-tuning more robust to non-sententential data. We don't have any hard numbers on this.)

For our sentence segmenter I just used some Google-internal library I found, but anything off the shelf like (SpaCy)[https://spacy.io/usage/spacy-101] should work.

winston-zillow · 2018-10-31T23:33:56Z

I see, especially the explanation in create_preprocessing_data.py. It would be nice to mention your comments in the README, or even update the paper.

jacobdevlin-google · 2018-11-01T03:05:42Z

I added a paragraph in the README about this, thanks.

xgk · 2018-11-08T06:13:35Z

Hello @jacobdevlin-google
Really appreciate your work:) Would you mind to share us the pre-training set size in terms of instances? Even though you mentioned the corpus size is about 3.3 billion words, but when preprocess the data before feeding to the model, I found you have a setting of replicate factor of 10 by default. And based on my current experiment, I found the preprocess will enlarge the training data by about 20x of original. Not sure if I made any mistake, would you share us the actual training set size and confirm if my finds is correct?
Thanks!

artemisart · 2018-11-08T15:26:02Z

@xgk are you using Chinese or some other languages where one token correspond to one char and not one word ? That would explain this size augmentation.

and additionally inclues Thai and Mongolian. -> and additionally includes Thai and Mongolian. FIX google-research#13

jacobdevlin-google closed this as completed Nov 1, 2018

nishans08 added a commit to nishans08/bert that referenced this issue Jul 28, 2022

Fix errors in README.md

ad81ff9

and additionally inclues Thai and Mongolian. -> and additionally includes Thai and Mongolian. FIX google-research#13

nishans08 mentioned this issue Jul 28, 2022

Fix errors in README.md #1355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need clarification for pre-training #13

Need clarification for pre-training #13

winston-zillow commented Oct 31, 2018

jacobdevlin-google commented Oct 31, 2018

winston-zillow commented Oct 31, 2018

jacobdevlin-google commented Nov 1, 2018

xgk commented Nov 8, 2018

artemisart commented Nov 8, 2018

Need clarification for pre-training #13

Need clarification for pre-training #13

Comments

winston-zillow commented Oct 31, 2018

jacobdevlin-google commented Oct 31, 2018

winston-zillow commented Oct 31, 2018

jacobdevlin-google commented Nov 1, 2018

xgk commented Nov 8, 2018

artemisart commented Nov 8, 2018