OCR_Conversion_JPEG2PDF - Windows

This is a very rough code.

JPEG to OCR'd PDF conversion using tesseract v4 through cmd. Includes OCR'ing the JPEG's and combining multi-page PDF to one.

It is just a simple implementation of using tesseract with python (uses os.system for making it work through command line). It works well on windows, however, I couldn't find a way for PDF to PDF conversion using command line as we need to read PDF using command line. On the other hand, reading a JPEG is still possible with libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0 libraries present in windows.

I'm currently trying to make ocrmypdf on windows as it shows error in leptonica.py about the dll. It's not impossible to do, if anyone finds a way, you can make changes in the repository in a new branch.

Requirements

Make sure to install libraries in the same manner

libjpeg : libpng : libtiff : zlib : libwebp : libopenjp2
leptonica (v1.78) (you can use any version but you would need to change the location of liblept.so location in the code)
Tesseract (any version)
Tesseract Language Data (big tessdata)
ocrmypdf library

Workflow

You need to provide the converted JPEG's of PDF's to the code
Naming convention for JPEG: PDFname_count (if you want to change, make changes in the ReGex too)
All the JPEG's must be present in single folder
OCR folder will be created in root folder
PDF's will be created page-wise
Page-wise PDF's will be merged into one parent PDF automatically
Parent PDF's will be placed in OCR folder.

Licensing

You can use this repository in anyway you need. Kindly make any changes in a different branch.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
poppler-0.68.0		poppler-0.68.0
Combined.py		Combined.py
GUI.py		GUI.py
GUI_Class.py		GUI_Class.py
PDF2 IMG.py		PDF2 IMG.py
PDF_Split.py		PDF_Split.py
README.md		README.md
progress_bar.py		progress_bar.py
tesseract_Terminal.py		tesseract_Terminal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

poppler-0.68.0

poppler-0.68.0

Combined.py

Combined.py

GUI.py

GUI.py

GUI_Class.py

GUI_Class.py

PDF2 IMG.py

PDF2 IMG.py

PDF_Split.py

PDF_Split.py

README.md

README.md

progress_bar.py

progress_bar.py

tesseract_Terminal.py

tesseract_Terminal.py

Repository files navigation

OCR_Conversion_JPEG2PDF - Windows

This is a very rough code.

Requirements

Workflow

Licensing

About

Releases

Packages

Languages

lakshay1296/OCR_Conversion_JPEG2PDF

Folders and files

Latest commit

History

Repository files navigation

OCR_Conversion_JPEG2PDF - Windows

This is a very rough code.

Requirements

Workflow

Licensing

About

Topics

Resources

Stars

Watchers

Forks

Languages