Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大佬好,请问下数据构造中的特殊token #208

Open
IamRoBota opened this issue Apr 29, 2023 · 2 comments
Open

大佬好,请问下数据构造中的特殊token #208

IamRoBota opened this issue Apr 29, 2023 · 2 comments

Comments

@IamRoBota
Copy link

IamRoBota commented Apr 29, 2023

看到在TokenTruncation.process()中构造input_ids时,拼完a和b之后,在句尾添加了两个。
Screenshot 2023-04-29 at 23 21 50

请问:
1.为什么需要两个呢,一个会怎么样?
2.如果我在句子a中需要一个特殊token来分隔一下a中的上下两句,请问选哪个好一些呢?我看ChatGLM tokenizer的特殊token只有<eop> <pad> <sop> <unk>和[MASK]

感谢🙏

@ssbuild
Copy link
Owner

ssbuild commented Apr 30, 2023

一个两个都可以,只是加强下结束符。

@IamRoBota
Copy link
Author

一个

谢谢大佬,那请问第二个问题呢?不用换行符的话,更好一点吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants