Skip to content
This repository has been archived by the owner on Dec 3, 2023. It is now read-only.
/ PyPDFtoText Public archive

This is a Python script that converts any PDF to text using Tesseract-OCR(For Text locked pdfs).

Notifications You must be signed in to change notification settings

Kaushal1011/PyPDFtoText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPDFtoText

  • This is Python script that converts any PDF to text using tesseract-OCR. I made this to process pdfs in which text is not selectable.
  • Please donot use on normal pdfs of which you can just copy out text as this is a heavy to process and slowtask
  • it works best on simple pdfs which have data in simple book format(also depends on your tesseract installation), more updates coming soon maybe
  • This uses Tesseract-OCR binaries, pytesseract, PyMuPDF and PIL packages
  • If you cannot install fitz. try "pip install PyMuPDF"

About

This is a Python script that converts any PDF to text using Tesseract-OCR(For Text locked pdfs).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published