Skip to content

Latest commit

 

History

History
40 lines (38 loc) · 1.74 KB

dataprep.md

File metadata and controls

40 lines (38 loc) · 1.74 KB

Data Preparation for BERT Pretraining

The following steps are to prepare Wikipedia corpus for pretraining. However, these steps can be used with little or no modification to preprocess other datasets as well:

  1. Download wiki dump file from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
    This is a zip file and needs to be unzipped.
  2. Clone Wikiextractor, and run it:
    git clone https://github.com/attardi/wikiextractor
    python3 wikiextractor/WikiExtractor.py -o out -b 1000M enwiki-latest-pages-articles.xml
    
    Running time can be 5-10 minutes/GB.
    output: out directory
  3. Run:
    ln -s out out2
    python3 AzureML-BERT/pretrain/PyTorch/dataprep/single_line_doc_file_creation.py
    
    This script removes html tags and empty lines and outputs to one file where each line is a paragraph.
    (pip install tqdm if needed.)
    output: wikipedia.txt
  4. Run:
    python3 AzureML-BERT/pretrain/PyTorch/dataprep/sentence_segmentation.py wikipedia.txt wikipedia.segmented.nltk.txt
    
    This script converts wikipedia.txt to one file where each line is a sentence.
    (pip install nltk if needed.)
    output: wikipedia.segmented.nltk.txt
  5. Split the above output file into ~100 files by line with:
    mkdir data_shards
    python3 AzureML-BERT/pretrain/PyTorch/dataprep/split_data_into_files.py
    
    output: data_shards directory
  6. Run:
    python3 AzureML-BERT/pretrain/PyTorch/dataprep/create_pretraining.py --input_dir=data_shards --output_dir=pickled_pretrain_data --do_lower_case=true
    
    This script will convert each file into pickled .bin file.
    output: pickled_pretrain_data directory