Text Breaking when used for Gurmukhi(punjabi) script #42

anagha-choudhari19 · 2022-08-21T12:05:58Z

I want to extract text from PDF for Gurmukhi script which is punjabi laguage
but characters wrongly read while extracting the text from pdf

`pdf_path='/content/Punjab2_new.pdf'
doc = Document(pdf_path)

text_control=TextControl("physical",insert_bom=True)
for page in range(len(doc)):
out_res=doc[page].text((0,90,155,700),text_control)
print('\n_______________New_page_output_________________________\n')
print(out_res)`

here are my expected and actual result images
expected image is sample of my input :

and with text function I am having false charecter recognition issue:

PDF
download.pdf

It will be a great help if any parameters of pyxpdf solve the issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Breaking when used for Gurmukhi(punjabi) script #42

Text Breaking when used for Gurmukhi(punjabi) script #42

anagha-choudhari19 commented Aug 21, 2022

Text Breaking when used for Gurmukhi(punjabi) script #42

Text Breaking when used for Gurmukhi(punjabi) script #42

Comments

anagha-choudhari19 commented Aug 21, 2022