Need Mongolian traineddata #85

Skeetfly · 2017-11-16T07:56:21Z

I'm thinking about using tesseract on lpr how good is it?

scubess · 2018-02-08T17:41:46Z

Does any one got update to train mongolian Language ?

stweil · 2018-02-17T16:50:01Z

There are some repositories on GitHub: khangaikh/tesseract-mon, dolugen/tesseract-mnc, maybe more.

But there seems to be code missing in Tesseract for Mongolian, see ccmain/pageiterator.cpp.

Shreeshrii · 2018-02-17T18:18:46Z

http://www.alanwood.net/unicode/mongolian.html

The Mongolian range was introduced with version 3.0 of the Unicode Standard. Mongolian is the caseless script used for writing Menggu (the language of the Chinese province of Nei Menggu) and for the Manchu, Sibe and Todo languages. It was formerly used for Khalkha, the national language of Mongolia, but is now mainly restricted to religious texts, having been replaced by Cyrillic for other uses. Mongolian is written vertically from left to right.

khangaikh/tesseract-mon, dolugen/tesseract-mnc,

Both of these are for Mongolian-Cyrillic

Tesseract repos also have mon.traineddata - not sure whether it is cyrillic or otherwise.

https://github.com/tesseract-ocr/tessdata_fast/blob/master/mon.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/master/mon.traineddata

Shreeshrii · 2018-02-18T12:26:41Z

I checked the wordlist from mon.traineddata. Here is a sample from it:

Хи
Хил
Хилчид
Хилчин
Хилчний
Хилээс
Хилээр
Хилэн
Хилэнц
Хилэнцийн
Хилэнцийнхэн
Хилари
Хилл
Хиллари

So it looks like, it is Mongolian-Cyrillic.

The most recent Mongolian alphabet is a based on the Cyrillic script, more specifically the Russian alphabet plus the letters, Өө /ö/ and Үү /ü/. It was introduced in the 1940s and has been in use as the official writing system of Mongolia ever since.

ref: https://en.wikipedia.org/wiki/Mongolian_writing_systems

@Skeetfly @scubess

Were you looking for Mongolian-Cyrillic or the traditional Mongolian traineddata?

Shreeshrii · 2018-02-18T12:30:07Z

@stweil

Mongolian, written in Mongolian script is written vertically from left to right. https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/pageiterator.cpp#L543 seems related to that.

However, the mon.traineddata which is Mongolian in Cyrrilic, does not require it.

Here is sample of wordlist for Mongolian, written in Mongolian script taken from http://crubadan.org/languages/mn-Mong

ᠢᠨ 3802
ᠡᠮᠦᠨᠡᠡ 2800
ᠠ 2670
ᠰᠠᠷᠠᠠ 2083
ᠢ 1830
ᠦᠭᠡᠢ 1574
ᠳᠤ 1543
ᠡ 1501
ᠦᠨ 1453
ᠨᠢ 1422
ᠭᠠᠷᠠᠭ 1388
ᠪᠠᠢᠨᠠᠠ 1315
ᠤᠨ 1220
ᠶᠢᠨ 1178
ᠤ 1058
ᠳᠦ 1026

Shreeshrii · 2018-02-18T12:40:37Z

Related Info:

http://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Mong

https://www.ethnologue.com/language/mvf

https://groups.google.com/forum/#!msg/tesseract-ocr/EjnYPwmx7UM/lmzi37oKjQsJ
how add a new language
tesseract mvf.baiti.exp0.tif mvf.baiti.exp0 -l mvf batch.nochop makebox

http://www.babelstone.co.uk/Mongolian/Report170.pdf
http://www.babelstone.co.uk/Mongolian/Report170A.pdf
http://www.babelstone.co.uk/Mongolian/Report170B.pdf

https://r12a.github.io/mongolian-variants/
https://r12a.github.io/scripts/links?script=mongolian

scubess · 2018-02-18T21:35:20Z

@Shreeshrii @stweil Hi guys,

Thanks for your replies !As you mentioned @Shreeshrii , I am not either sure about tessdata_best mon. tranineddata file has trained traditional or Cyrillic.
On the other side, I tried to integrate the mon.traineddata file for the iOS app which i am working on. So i tried to use tessdata_best mon.traineddata, but it is crashing all the time with,

actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES

I found the trained version mismatched with the tesseract engine version. Which is a different issue then what we are taking here. So i made it working on Cyrillic text data when I trained with
Tesseract 3.03-rc1 (Homepage)
Leptonica 1.71 (Homepage)
Thanks for your reply and also based on the sample training text, i can see Mongolian Cyrillic is recognised correctly. I put it in a repo for people who are looking for Mongolian Cyrillic trained data https://github.com/scubess/Tesseract-Mongolian-Training

@Shreeshrii i will update the traineddata file with wordlist too.

@Skeetfly for lpr, you can apply regex to the recognised result from tesseract.

I Hope it's useful ...

suyie001 · 2024-03-01T12:58:26Z

Is there any progress in the work on traditional Mongolian?

stweil · 2024-03-01T13:01:57Z

I don't know of anyone who works on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need Mongolian traineddata #85

Need Mongolian traineddata #85

Skeetfly commented Nov 16, 2017

scubess commented Feb 8, 2018

stweil commented Feb 17, 2018

Shreeshrii commented Feb 17, 2018

Shreeshrii commented Feb 18, 2018

Shreeshrii commented Feb 18, 2018

Shreeshrii commented Feb 18, 2018 •

edited

scubess commented Feb 18, 2018 •

edited

suyie001 commented Mar 1, 2024

stweil commented Mar 1, 2024

Need Mongolian traineddata #85

Need Mongolian traineddata #85

Comments

Skeetfly commented Nov 16, 2017

scubess commented Feb 8, 2018

stweil commented Feb 17, 2018

Shreeshrii commented Feb 17, 2018

Shreeshrii commented Feb 18, 2018

Shreeshrii commented Feb 18, 2018

Shreeshrii commented Feb 18, 2018 • edited

scubess commented Feb 18, 2018 • edited

suyie001 commented Mar 1, 2024

stweil commented Mar 1, 2024

Shreeshrii commented Feb 18, 2018 •

edited

scubess commented Feb 18, 2018 •

edited