Synthetical comparison with Abbyy #108

jbarth-ubhd · 2018-08-03T09:28:35Z

Dear Reader,

I've did some comparison with random text.

Random text, to test the raw engine performance, not dictionaries
because foreign, perhaps transcripted (foreign) names sometimes look like "sviyazhsk", "kozhva", "jizzax", ...

here is the original random text:
original text

here is the generated image (font: GaramondNo8):

Result:

Filename	Levenshtein distance
abbyy11-English.txt	5
abbyy11-GermanLuxembourg.txt	2
orig.txt	0
v3.04.01 tess3-eng.txt	1273
v3.04.01 tess3-engWithoutDict.txt	763
v4.0.0-beta.2-556-g607e tess4-eng.txt	222
v4.0.0-beta.2-556-g607e tess4-engWithoutDict.txt	215
v4.0.0-beta.2-556-g607e tess4-scriptLatin.txt	62
v4.1.0 ______________ tess4-scriptLatin.txt	62
v4.0.0-beta.2-556-g607e tess4-scriptLatinWithoutDict.txt	58
v4.0.0-beta.2-556-g607e tess4-scriptLatinWithoutDict.txt, ą replaced by q manually	45

Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an".

engWithoutDict has been made using

combine_tessdata -u ...
rm *-dawg 
combine_tessdata ...

Kind regards,
Jochen

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2018-08-03T09:36:10Z

Please test using script/Latin which supports all languages written in Latin script. That may give better results than eng. I am assuming that you used tesseract 4.0.0-beta. It is possible that legacy tesseract gives better results than LSTM based.

…

On Fri, Aug 3, 2018 at 2:58 PM jbarth-ubhd ***@***.***> wrote: Dear Reader, I've did some comparison with random text. - Random text, to test the raw engine performance, not dictionaries - because foreign, perhaps transcripted (foreign) names sometimes look like "Dhagax", "Hlabisa", "Pniv", ... here is the original random text: original text <https://digi.ub.uni-heidelberg.de/diglitData/v/orig.txt> here is the generated image (font: GaramondNo8): [image: Image of "Scan"] <https://camo.githubusercontent.com/3aa4d17c2d9486c47bc4f9c6e19cf2893d9f7c9d/68747470733a2f2f646967692e75622e756e692d68656964656c626572672e64652f6469676c6974446174612f762f6f7269673030312e746966> Result: Filename Levenshtein distance abbyy11r8u3-English.txt 5 abbyy11r8u3-GermanLuxembourg.txt 2 orig.txt 0 tess-eng.txt 221 tess-engWithoutDict.txt 214 Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an". engWithoutDict has been made using combine_tessdata & rm *-dawg & combine_tessdata. Kind regards, Jochen — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#108>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o2o4YIv-nergpBLmHEFcqnQAy-hWks5uNBfGgaJpZM4VtquT> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

jbarth-ubhd · 2018-08-03T10:34:37Z

I've updated the table above. Thanks for the hint with script/Latin.traineddata.

Seems that tesseract3 has relatively ~~more~~ less "dictionary" in non-dict traineddata than tesseract4(LSTM).

Much better.

jbarth-ubhd · 2018-08-03T11:34:56Z

Here the wdiff -3 from tess4-scriptLatinWithoutDict-aq.txt (ą replaced by q):

characters replaced	count
v y	1
i j	3
e c	4
í f	2
I l	1
i l	1
c r	1
o p	1

(statistics not including [-chwin-] {+thwtn+} ... )

 [-vrahi-] {+yrahj+}
 [-kemiw-] {+kcmiw+}
 [-UMWDYV tshhqq-] {+UMWDV tshhq+}
 [-tfcemr-] {+tfcmr+}
 [-byovsj-] {+byovs+}
 {+oanpv+}
 [-bfjw-] {+bfjwj+}
 [-druar|-] {+druar+}
[-víddh-] {+vfddh+}
 [-izabt-] {+jzabt+}
 [-Inblf-] {+lnblf+}
[-dyzłj j j-] {+dyzfj+}
 [-Wírbk cbked-] {+Wfrbk rbkcd+}
 [-ordkp-] {+prdkp+}
 [-hcecpn-] {+hccpn+}
 [-urvle Wihsx-] {+urvlc Wlhsx+}
 [-hmtfkj-] {+hmfkj+}
 [-czhi-] {+pczhj+}
 [-chwin-] {+thwtn+}
 [-hcans-] {+hcqns+}
 [-rzhje-] {+rzhjc+}
 [-Irngj-] {+lrngj+}
 [-o0xws-] {+ooxws+}
 [-ubemsc-] {+ubemc+}
[-Ibknv-] {+lbknv+}

Shreeshrii · 2018-08-03T11:41:33Z

If there is an actual use case for this, I would suggest to finetune Latin traineddata with similarly generated random training text - using the Garamond font being tested and finetune for IMPACT - 300-400 iterations only.

amitdo · 2018-08-03T14:02:43Z

Thanks for sharing!

Testing with random characters can make the lstm-based recognizer less accurate than real world text sample, due to the fact that during training the network learns not just letters shapes, but also builds a language model.

stweil · 2018-08-13T13:19:17Z

Results from ABBYY (GermanLuxembourg):

$ LANG=C dwdiff -3 -s test.gt.txt test.abbyy-GermanLuxembourg.txt
======================================================================
 [-Ftvmn-] {+Ltvmn+}
======================================================================
old: 1018 words  1017 99% common  0 0% deleted  1 0% changed
new: 1018 words  1017 99% common  0 0% inserted  1 0% changed

Execution time was 6.6 s.

stweil · 2018-08-13T13:28:56Z

Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 0):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess0.txt 
======================================================================
 [-ossac-] {+033210+}
======================================================================
 [-Edqgd-] {+Edqu+}
======================================================================
 [-olpso gxgko-] {+01pso ngko+}
======================================================================
 [-Yfcpd fndtv-] {+chpd fndtV+}
[...]
 [-xczxj wbjif axggb ilboa Nhxmg qkgvt-] {+XCZXj ijif 21ngb 11b021 Nthg quVt+}
======================================================================
 [-Esxgd dgrjx jyelz-] {+ESng dgrjX jye1z+}
======================================================================
old: 1018 words  379 37% common  0 0% deleted  639 62% changed
new: 1045 words  379 36% common  0 0% inserted  666 63% changed

Execution time was 43.8 s.

With --psm 6, the result becomes much better:

old: 1018 words  772 75% common  0 0% deleted  246 24% changed
new: 1032 words  772 74% common  0 0% inserted  260 25% changed

Using the lat traineddata (which was trained with EB Garamond) further improves the recognition:

old: 1018 words  832 81% common  0 0% deleted  186 18% changed
new: 1034 words  832 80% common  0 0% inserted  202 19% changed

stweil · 2018-08-13T13:34:10Z

Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 1):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess1.txt 
======================================================================
 [-fguof-] {+fguoft+}
======================================================================
 [-byysf Yfcpd fndtv-] {+byyst Yicpd tndtv+}
======================================================================
 [-pkqff-] {+pkqtf+}
[...]
 [-xcwnc-] {+xcwne+}
======================================================================
 [-Vfwfi-] {+Viwti+}
======================================================================
 [-krfgr-] {+krigr+}
======================================================================
old: 1018 words  838 82% common  0 0% deleted  180 17% changed
new: 1021 words  838 82% common  0 0% inserted  183 17% changed

Execution time was 85.7 s.

stweil · 2018-08-13T13:43:19Z

Result from Tesseract 4.0.0-beta.4 (tessdata/script/Latin):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess-latin.txt 
[...]
old: 1018 words  975 95% common  0 0% deleted  43 4% changed
new: 1022 words  975 95% common  2 0% inserted  45 4% changed

Execution time was 121 s.

Surprisingly the result becomes better with --psm 6, so the layout / line detection seems to have effects even for a very simple image like the present test image:

old: 1018 words  988 97% common  0 0% deleted  30 2% changed
new: 1018 words  988 97% common  0 0% inserted  30 2% changed

All test runs with the default page segmentation mode report diacritics (although there are none) which might be related to bad recognition rates:

Detected 143 diacritics

Shreeshrii · 2018-08-13T15:01:19Z

@stweil What about tessdata_fast and tessdata_best?

amitdo · 2018-08-13T15:34:53Z

Apart from Shree question,
does Abbyy uses more than one thread by default?

stweil · 2018-08-14T06:26:01Z

As tessdata uses fast data derived from tessdata_best, I don't expect much different results. Nevertheless I can run a test later.

ABBYY used a single thread. The Tesseract timings where also single threaded results. But my first focus is not execution time: quality of the OCR results is much more important for our application on old books and journals. We have other reports that Tesseract beats ABBYY when the text is already split in single lines. That would imply that Tesseract is less good than ABBYY for layout recognition (line separation), maybe also for binarization.

stweil · 2018-08-14T06:42:02Z

Result from Tesseract 4.0.0-beta.4 (tessdata_best/script/Latin, --psm 6):

old: 1018 words  996 97% common  0 0% deleted  22 2% changed
new: 1018 words  996 97% common  0 0% inserted  22 2% changed

Execution time was 318 s.

Result from Tesseract 4.0.0-beta.4 (tessdata_fast/script/Latin, --psm 6):

old: 1018 words  976 95% common  0 0% deleted  42 4% changed
new: 1018 words  976 95% common  0 0% inserted  42 4% changed

Execution time was 71 s.

Shreeshrii · 2018-08-14T08:21:35Z


tessdata_best	97%	318 s
tessdata	97%	121s	tessdata uses `fast` data derived from tessdata_best,
tessdata_fast	95%	71s

Shreeshrii · 2018-08-14T15:03:33Z

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#integration-with-tesseract The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline. The neural network engine is the default for 4.00. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13 @ stweil Would --psm 13 give better results?

stweil · 2018-08-14T20:28:30Z

Would --psm 13 give better results?

Maybe – what would you suggest for the line separation?

Shreeshrii · 2018-08-15T04:11:37Z

Oh, thanks for pointing this out, that would need to be done externally.

…

On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil ***@***.***> wrote: Would --psm 13 give better results? Maybe – what would you suggest for the line separation? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#108 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii · 2018-08-15T11:34:15Z

@stweil In case, you are looking at improving line detection/page segmentation in tesseract, leptonica has some 'newer' functions which gave good results with test of Arabic and Devanagari. DanBloomberg/leptonica#236 On Wed, Aug 15, 2018 at 9:40 AM, Shree Devi Kumar <shreeshrii@gmail.com> wrote:

…

Oh, thanks for pointing this out, that would need to be done externally. On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil ***@***.***> wrote: > Would --psm 13 give better results? > > Maybe – what would you suggest for the line separation? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#108 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

stweil mentioned this issue Aug 24, 2018

Recognition on x32 tesseract-ocr/tesseract#1838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetical comparison with Abbyy #108

Synthetical comparison with Abbyy #108

jbarth-ubhd commented Aug 3, 2018 •

edited

Shreeshrii commented Aug 3, 2018 via email

jbarth-ubhd commented Aug 3, 2018 •

edited

jbarth-ubhd commented Aug 3, 2018

Shreeshrii commented Aug 3, 2018

amitdo commented Aug 3, 2018

stweil commented Aug 13, 2018 •

edited

stweil commented Aug 13, 2018 •

edited

stweil commented Aug 13, 2018

stweil commented Aug 13, 2018 •

edited

Shreeshrii commented Aug 13, 2018

amitdo commented Aug 13, 2018

stweil commented Aug 14, 2018

stweil commented Aug 14, 2018 •

edited

Shreeshrii commented Aug 14, 2018 •

edited

Shreeshrii commented Aug 14, 2018 via email

stweil commented Aug 14, 2018

Shreeshrii commented Aug 15, 2018 via email

Shreeshrii commented Aug 15, 2018 via email

Synthetical comparison with Abbyy #108

Synthetical comparison with Abbyy #108

Comments

jbarth-ubhd commented Aug 3, 2018 • edited

Shreeshrii commented Aug 3, 2018 via email

jbarth-ubhd commented Aug 3, 2018 • edited

jbarth-ubhd commented Aug 3, 2018

Shreeshrii commented Aug 3, 2018

amitdo commented Aug 3, 2018

stweil commented Aug 13, 2018 • edited

stweil commented Aug 13, 2018 • edited

stweil commented Aug 13, 2018

stweil commented Aug 13, 2018 • edited

Shreeshrii commented Aug 13, 2018

amitdo commented Aug 13, 2018

stweil commented Aug 14, 2018

stweil commented Aug 14, 2018 • edited

Shreeshrii commented Aug 14, 2018 • edited

Shreeshrii commented Aug 14, 2018 via email

stweil commented Aug 14, 2018

Shreeshrii commented Aug 15, 2018 via email

Shreeshrii commented Aug 15, 2018 via email

jbarth-ubhd commented Aug 3, 2018 •

edited

jbarth-ubhd commented Aug 3, 2018 •

edited

stweil commented Aug 13, 2018 •

edited

stweil commented Aug 13, 2018 •

edited

stweil commented Aug 13, 2018 •

edited

stweil commented Aug 14, 2018 •

edited

Shreeshrii commented Aug 14, 2018 •

edited