Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TinyBERT实验到底用哪个enwiki-latest-pages-articles数据集? #230

Open
ra225 opened this issue Dec 11, 2022 · 0 comments
Open

TinyBERT实验到底用哪个enwiki-latest-pages-articles数据集? #230

ra225 opened this issue Dec 11, 2022 · 0 comments

Comments

@ra225
Copy link

ra225 commented Dec 11, 2022

原文第6页提到:
For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words)
我从
https://github.com/google-research/bert 指定的链接下载
the latest dump
此压缩包解压后形成了一个86G的xml文件,经本工程的预处理代码总是报超磁盘空间,且每跑十几个小时就断掉,查代码以后,将pregenerate_training_date.py文件第52行self.document_shelf_filepath的路径从/cache/目录改到外部磁盘的500G文件目录,这次终于不再报超磁盘空间,但处理速度很慢,84个小时才从第367行跑到第390行。
然后最崩溃的来了!由于后面还要跑3个epoch,又跑了2天才跑完第一个epoch的5%,合着40天才能跑完一个epoch,总共3个epoch就要120天!
仅仅数据预处理就要跑这么久吗?即使跑完,后面还要上GPU训练,会不会更久???
请问原文用的是哪个数据集?是不是要用华为云平台跑才能快一些?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant