Modern Greek data issues #160

chopinesque · 2022-06-03T09:34:54Z

There are 2 major issues with the Greek data.

They tend to produce µ (micro sign) instead of μ (Greek m letter) and despite choosing Modern Greek (ell), some characters have accents that belong to polytonic Greek.

stweil · 2022-06-03T11:11:18Z

https://github.com/tesseract-ocr/langdata_lstm/tree/main/ell contains training text and a word list with the same issues, so the model was trained to produce such results.

chopinesque · 2022-06-03T11:22:16Z

Right, so how can this be fixed? For example I can see in https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/desired_characters and https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/ell.unicharset the existence of polytonic characters which should not be there.

stweil · 2022-06-03T11:27:18Z

In a first step you could send a pull request for langdata_lstm which fixes the files there. But finally new trainings are required, maybe based on the existing models for Greek.

chopinesque · 2022-06-03T11:56:59Z

OK, I may need some guidance please. I created a fork. So do I simply have to remove non-valid characters from above mentioned files?

I also see

tessedit_load_sublangs grc
https://github.com/chopinesque/langdata_lstm_modern_greek/blob/main/ell/ell.config#L2

I am not sure whether this line should be there going forward.

stweil · 2022-06-03T12:42:22Z

So do I simply have to remove non-valid characters from above mentioned files?

Remove or replace, what fits better.

chopinesque · 2022-06-03T13:08:47Z

Thanks. If I replace, I need to know about the structure, for example,

ὶ 3 0,255,0,255,0,0,0,0,0,0 Greek 124 0 124 ὶ # ὶ [1f76 ]a

How is the 124 0 124 derived?

stweil · 2022-06-03T13:11:33Z

You can keep the unicharset file unmodified. A replacement will be created when a new training is run.

stweil · 2022-06-03T13:13:30Z

tessedit_load_sublangs grc

That line tells Tesseract to always use grc in addition to ell. Therefore wrong glyphs can also come from grc as long as that configuration is there.

chopinesque · 2022-06-03T13:17:39Z

You can keep the unicharset file unmodified. A replacement will be created when a new training is run.

Not sure then which files I should change. I don't think I have the knowledge to do any training (I also use Windows).

chopinesque · 2022-06-03T13:17:56Z

tessedit_load_sublangs grc

That line tells Tesseract to always use grc in addition to ell. Therefore wrong glyphs can also come from grc as long as that configuration is there.

So this line should be removed.

stweil added the help wanted label May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modern Greek data issues #160

Modern Greek data issues #160

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

chopinesque commented Jun 3, 2022

Modern Greek data issues #160

Modern Greek data issues #160

Comments

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

stweil commented Jun 3, 2022

stweil commented Jun 3, 2022

chopinesque commented Jun 3, 2022

chopinesque commented Jun 3, 2022