Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetical comparison with Abbyy #108

Open
jbarth-ubhd opened this issue Aug 3, 2018 · 18 comments
Open

Synthetical comparison with Abbyy #108

jbarth-ubhd opened this issue Aug 3, 2018 · 18 comments

Comments

@jbarth-ubhd
Copy link

jbarth-ubhd commented Aug 3, 2018

Dear Reader,

I've did some comparison with random text.

  • Random text, to test the raw engine performance, not dictionaries
  • because foreign, perhaps transcripted (foreign) names sometimes look like "sviyazhsk", "kozhva", "jizzax", ...

here is the original random text:
original text

here is the generated image (font: GaramondNo8):
Image of "Scan"

Result:

Filename Levenshtein distance
abbyy11-English.txt 5
abbyy11-GermanLuxembourg.txt 2
orig.txt 0
v3.04.01 tess3-eng.txt 1273
v3.04.01 tess3-engWithoutDict.txt 763
v4.0.0-beta.2-556-g607e tess4-eng.txt 222
v4.0.0-beta.2-556-g607e tess4-engWithoutDict.txt 215
v4.0.0-beta.2-556-g607e tess4-scriptLatin.txt 62
v4.1.0 ______________ tess4-scriptLatin.txt 62
v4.0.0-beta.2-556-g607e tess4-scriptLatinWithoutDict.txt 58
v4.0.0-beta.2-556-g607e tess4-scriptLatinWithoutDict.txt, ą replaced by q manually 45

Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an".

engWithoutDict has been made using

combine_tessdata -u ...
rm *-dawg 
combine_tessdata ...

Kind regards,
Jochen

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 3, 2018 via email

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Aug 3, 2018

I've updated the table above. Thanks for the hint with script/Latin.traineddata.

Seems that tesseract3 has relatively more less "dictionary" in non-dict traineddata than tesseract4(LSTM).

Much better.

@jbarth-ubhd
Copy link
Author

Here the wdiff -3 from tess4-scriptLatinWithoutDict-aq.txt (ą replaced by q):

characters replaced count
v y 1
i j 3
e c 4
í f 2
I l 1
i l 1
c r 1
o p 1

(statistics not including [-chwin-] {+thwtn+} ... )

 [-vrahi-] {+yrahj+}
 [-kemiw-] {+kcmiw+}
 [-UMWDYV tshhqq-] {+UMWDV tshhq+}
 [-tfcemr-] {+tfcmr+}
 [-byovsj-] {+byovs+}
 {+oanpv+}
 [-bfjw-] {+bfjwj+}
 [-druar|-] {+druar+}
[-víddh-] {+vfddh+}
 [-izabt-] {+jzabt+}
 [-Inblf-] {+lnblf+}
[-dyzłj j j-] {+dyzfj+}
 [-Wírbk cbked-] {+Wfrbk rbkcd+}
 [-ordkp-] {+prdkp+}
 [-hcecpn-] {+hccpn+}
 [-urvle Wihsx-] {+urvlc Wlhsx+}
 [-hmtfkj-] {+hmfkj+}
 [-czhi-] {+pczhj+}
 [-chwin-] {+thwtn+}
 [-hcans-] {+hcqns+}
 [-rzhje-] {+rzhjc+}
 [-Irngj-] {+lrngj+}
 [-o0xws-] {+ooxws+}
 [-ubemsc-] {+ubemc+}
[-Ibknv-] {+lbknv+}

@Shreeshrii
Copy link
Contributor

If there is an actual use case for this, I would suggest to finetune Latin traineddata with similarly generated random training text - using the Garamond font being tested and finetune for IMPACT - 300-400 iterations only.

@amitdo
Copy link

amitdo commented Aug 3, 2018

Thanks for sharing!

Testing with random characters can make the lstm-based recognizer less accurate than real world text sample, due to the fact that during training the network learns not just letters shapes, but also builds a language model.

@stweil
Copy link
Contributor

stweil commented Aug 13, 2018

Results from ABBYY (GermanLuxembourg):

$ LANG=C dwdiff -3 -s test.gt.txt test.abbyy-GermanLuxembourg.txt
======================================================================
 [-Ftvmn-] {+Ltvmn+}
======================================================================
old: 1018 words  1017 99% common  0 0% deleted  1 0% changed
new: 1018 words  1017 99% common  0 0% inserted  1 0% changed

Execution time was 6.6 s.

@stweil
Copy link
Contributor

stweil commented Aug 13, 2018

Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 0):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess0.txt 
======================================================================
 [-ossac-] {+033210+}
======================================================================
 [-Edqgd-] {+Edqu+}
======================================================================
 [-olpso gxgko-] {+01pso ngko+}
======================================================================
 [-Yfcpd fndtv-] {+chpd fndtV+}
[...]
 [-xczxj wbjif axggb ilboa Nhxmg qkgvt-] {+XCZXj ijif 21ngb 11b021 Nthg quVt+}
======================================================================
 [-Esxgd dgrjx jyelz-] {+ESng dgrjX jye1z+}
======================================================================
old: 1018 words  379 37% common  0 0% deleted  639 62% changed
new: 1045 words  379 36% common  0 0% inserted  666 63% changed

Execution time was 43.8 s.

With --psm 6, the result becomes much better:

old: 1018 words  772 75% common  0 0% deleted  246 24% changed
new: 1032 words  772 74% common  0 0% inserted  260 25% changed

Using the lat traineddata (which was trained with EB Garamond) further improves the recognition:

old: 1018 words  832 81% common  0 0% deleted  186 18% changed
new: 1034 words  832 80% common  0 0% inserted  202 19% changed

@stweil
Copy link
Contributor

stweil commented Aug 13, 2018

Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 1):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess1.txt 
======================================================================
 [-fguof-] {+fguoft+}
======================================================================
 [-byysf Yfcpd fndtv-] {+byyst Yicpd tndtv+}
======================================================================
 [-pkqff-] {+pkqtf+}
[...]
 [-xcwnc-] {+xcwne+}
======================================================================
 [-Vfwfi-] {+Viwti+}
======================================================================
 [-krfgr-] {+krigr+}
======================================================================
old: 1018 words  838 82% common  0 0% deleted  180 17% changed
new: 1021 words  838 82% common  0 0% inserted  183 17% changed

Execution time was 85.7 s.

@stweil
Copy link
Contributor

stweil commented Aug 13, 2018

Result from Tesseract 4.0.0-beta.4 (tessdata/script/Latin):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess-latin.txt 
[...]
old: 1018 words  975 95% common  0 0% deleted  43 4% changed
new: 1022 words  975 95% common  2 0% inserted  45 4% changed

Execution time was 121 s.

Surprisingly the result becomes better with --psm 6, so the layout / line detection seems to have effects even for a very simple image like the present test image:

old: 1018 words  988 97% common  0 0% deleted  30 2% changed
new: 1018 words  988 97% common  0 0% inserted  30 2% changed

All test runs with the default page segmentation mode report diacritics (although there are none) which might be related to bad recognition rates:

Detected 143 diacritics

@Shreeshrii
Copy link
Contributor

@stweil What about tessdata_fast and tessdata_best?

@amitdo
Copy link

amitdo commented Aug 13, 2018

Apart from Shree question,
does Abbyy uses more than one thread by default?

@stweil
Copy link
Contributor

stweil commented Aug 14, 2018

As tessdata uses fast data derived from tessdata_best, I don't expect much different results. Nevertheless I can run a test later.

ABBYY used a single thread. The Tesseract timings where also single threaded results. But my first focus is not execution time: quality of the OCR results is much more important for our application on old books and journals. We have other reports that Tesseract beats ABBYY when the text is already split in single lines. That would imply that Tesseract is less good than ABBYY for layout recognition (line separation), maybe also for binarization.

@stweil
Copy link
Contributor

stweil commented Aug 14, 2018

Result from Tesseract 4.0.0-beta.4 (tessdata_best/script/Latin, --psm 6):

old: 1018 words  996 97% common  0 0% deleted  22 2% changed
new: 1018 words  996 97% common  0 0% inserted  22 2% changed

Execution time was 318 s.

Result from Tesseract 4.0.0-beta.4 (tessdata_fast/script/Latin, --psm 6):

old: 1018 words  976 95% common  0 0% deleted  42 4% changed
new: 1018 words  976 95% common  0 0% inserted  42 4% changed

Execution time was 71 s.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 14, 2018

tessdata_best 97% 318 s
tessdata 97% 121s tessdata uses fast data derived from tessdata_best,
tessdata_fast 95% 71s

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 14, 2018 via email

@stweil
Copy link
Contributor

stweil commented Aug 14, 2018

Would --psm 13 give better results?

Maybe – what would you suggest for the line separation?

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 15, 2018 via email

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 15, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants