Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need Mongolian traineddata #85

Open
Skeetfly opened this issue Nov 16, 2017 · 9 comments
Open

Need Mongolian traineddata #85

Skeetfly opened this issue Nov 16, 2017 · 9 comments

Comments

@Skeetfly
Copy link

I'm thinking about using tesseract on lpr how good is it?

@scubess
Copy link

scubess commented Feb 8, 2018

Does any one got update to train mongolian Language ?

@stweil
Copy link
Contributor

stweil commented Feb 17, 2018

There are some repositories on GitHub: khangaikh/tesseract-mon, dolugen/tesseract-mnc, maybe more.

But there seems to be code missing in Tesseract for Mongolian, see ccmain/pageiterator.cpp.

@Shreeshrii
Copy link
Contributor

http://www.alanwood.net/unicode/mongolian.html

The Mongolian range was introduced with version 3.0 of the Unicode Standard. Mongolian is the caseless script used for writing Menggu (the language of the Chinese province of Nei Menggu) and for the Manchu, Sibe and Todo languages. It was formerly used for Khalkha, the national language of Mongolia, but is now mainly restricted to religious texts, having been replaced by Cyrillic for other uses. Mongolian is written vertically from left to right.

khangaikh/tesseract-mon, dolugen/tesseract-mnc,

Both of these are for Mongolian-Cyrillic

Tesseract repos also have mon.traineddata - not sure whether it is cyrillic or otherwise.

https://github.com/tesseract-ocr/tessdata_fast/blob/master/mon.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/master/mon.traineddata

@Shreeshrii
Copy link
Contributor

I checked the wordlist from mon.traineddata. Here is a sample from it:

Хи
Хил
Хилчид
Хилчин
Хилчний
Хилээс
Хилээр
Хилэн
Хилэнц
Хилэнцийн
Хилэнцийнхэн
Хилари
Хилл
Хиллари

So it looks like, it is Mongolian-Cyrillic.

The most recent Mongolian alphabet is a based on the Cyrillic script, more specifically the Russian alphabet plus the letters, Өө /ö/ and Үү /ü/. It was introduced in the 1940s and has been in use as the official writing system of Mongolia ever since.

ref: https://en.wikipedia.org/wiki/Mongolian_writing_systems

@Skeetfly @scubess

Were you looking for Mongolian-Cyrillic or the traditional Mongolian traineddata?

@Shreeshrii
Copy link
Contributor

@stweil

Mongolian, written in Mongolian script is written vertically from left to right. https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/pageiterator.cpp#L543 seems related to that.

However, the mon.traineddata which is Mongolian in Cyrrilic, does not require it.

Here is sample of wordlist for Mongolian, written in Mongolian script taken from http://crubadan.org/languages/mn-Mong

ᠢᠨ 3802
ᠡᠮᠦᠨᠡᠡ 2800
ᠠ 2670
ᠰᠠᠷᠠᠠ 2083
ᠢ 1830
ᠦᠭᠡᠢ 1574
ᠳᠤ 1543
ᠡ 1501
ᠦᠨ 1453
ᠨᠢ 1422
ᠭᠠᠷᠠᠭ 1388
ᠪᠠᠢᠨᠠᠠ 1315
ᠤᠨ 1220
ᠶᠢᠨ 1178
ᠤ 1058
ᠳᠦ 1026

@scubess
Copy link

scubess commented Feb 18, 2018

@Shreeshrii @stweil Hi guys,

Thanks for your replies !As you mentioned @Shreeshrii , I am not either sure about tessdata_best mon. tranineddata file has trained traditional or Cyrillic.
On the other side, I tried to integrate the mon.traineddata file for the iOS app which i am working on. So i tried to use tessdata_best mon.traineddata, but it is crashing all the time with,

actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES

I found the trained version mismatched with the tesseract engine version. Which is a different issue then what we are taking here. So i made it working on Cyrillic text data when I trained with
Tesseract 3.03-rc1 (Homepage)
Leptonica 1.71 (Homepage)
Thanks for your reply and also based on the sample training text, i can see Mongolian Cyrillic is recognised correctly. I put it in a repo for people who are looking for Mongolian Cyrillic trained data https://github.com/scubess/Tesseract-Mongolian-Training

@Shreeshrii i will update the traineddata file with wordlist too.

@Skeetfly for lpr, you can apply regex to the recognised result from tesseract.

I Hope it's useful ...

@suyie001
Copy link

suyie001 commented Mar 1, 2024

Is there any progress in the work on traditional Mongolian?

@stweil
Copy link
Contributor

stweil commented Mar 1, 2024

I don't know of anyone who works on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants