Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run "yarn run ingest" Japanese punctuation marks were converted to Korean #460

Closed
mikeyang01 opened this issue Feb 14, 2024 · 1 comment
Closed
Labels
kind: bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@mikeyang01
Copy link

mikeyang01 commented Feb 14, 2024

yarn run ingest result:

root@2gb:~/gpt4-pdf-chatbot-langchain/docs# yarn run ingest
yarn run v1.22.21
$ tsx -r dotenv/config scripts/ingest-data.ts
split docs [
Document {
pageContent: '日本人は셈 世界でいちばん寝不足 \n' +
'睡眠不足が慢性化すると셈 睡眠負債に \n' +
'近年셈섰睡眠負債섻という言葉が注目を集めるようになっています셆 睡 \n' +
'眠負債とは셈 どのような状態のことをいうのでしょうか셆 \n' +
'毎日 8 時間の睡眠を必要とする人が셈 ある日셈 何らかの理由で 6 時間 \n' +
'しか眠らなかったとします셆 この 2 時間の睡眠不足は셈 それだけでは \n' +
'睡眠負債とはいいません셆 しかし셈 睡眠不足が何日も重なり셈 数日か \n' +
'ら数週間の単位で睡眠不足が慢性化した状態になると셈 睡眠負債とよ \n' +
'ばれるようになります셆 \n' +

Original document:
日本人は,世界でいちばん寝不足
睡眠不足が慢性化すると,睡眠負債に
近年,「睡眠負債」という言葉が注目を集めるようになっています。睡眠負債とは,どのような状態のことをいうのでしょうか。
毎日8時間の睡眠を必要とする人が,ある日,何らかの理由で6時間しか眠らなかったとします。この2時間の睡眠不足は,それだけでは睡眠負債とはいいません。しかし,睡眠不足が何日も重なり,数日から数週間の単位で睡眠不足が慢性化した状態になると,睡眠負債とよばれるようになります。

error details:
。, was converted to Korean, like 셆

Copy link

dosubot bot commented Feb 14, 2024

🤖

Thank you for providing the detailed error log. It seems like the issue is related to the text processing part of the ingestion process. The yarn run ingest command uses a script that includes a text splitting step, which might be where the punctuation conversion is happening.

The script uses RecursiveCharacterTextSplitter to split the text into chunks. If this splitter is not handling Japanese punctuation correctly, it could be converting it to Korean punctuation during the splitting process.

However, without access to the specific implementation of RecursiveCharacterTextSplitter and the rest of the codebase, it's hard to pinpoint the exact cause of the issue. It would be helpful if you could provide more information about the RecursiveCharacterTextSplitter and how it's implemented, or any other part of the code that might be handling text processing.

In the meantime, you might want to check the configuration of the text splitter and see if there's a setting that's causing this issue. If the splitter is using a library or service for text processing, you might also want to check the documentation for that to see if there's any known issues or settings that could cause this.

If you're unable to find a solution, it might be worth considering using a different method for text splitting that can handle Japanese punctuation correctly. There are many libraries and services available that can handle text processing for different languages, so you might be able to find one that works better for your needs.

I hope this helps! If you have any more questions or need further clarification, please don't hesitate to ask.

Source Code

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the kind: bug Related to a bug, vulnerability, unexpected error with an existing feature label Feb 14, 2024
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 15, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant