Skip to content

zaakki-ahamed/Arabic_OCR_From_PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Arabic PDF OCR - Searchable PDF

Perform Optical Character Recognition (OCR) on a scanned PDF file containing Arabic text. I use Tesseract OCR to extract text from each page, generate a searchable PDF, and save the OCR text as a separate text file. Can aid in digitizing Arabic text from PDFs and creating searchable documents.

Requirements

Input / Output

  • Input : filePath variable points to your input PDF file.
  • Output : A new PDF file with searchable text generated from the OCR results and a text file containing the extracted Arabic text for each page.

Usage

  1. Install the required libraries from requirements.txt.
  2. Modify the filePath variable to point to your input PDF file.
  3. Set the path to the Tesseract OCR command in the script if needed by modifying the line - pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
  4. Run the script, and the combined PDF and translated text will be saved in the same directory.

Example:

# Set the path to the input PDF file
filePath = '/path/to/your/input.pdf'

# Set the path to the Tesseract OCR command
pytesseract.pytesseract.tesseract_cmd = '/path/to/your/tesseract'

# Run the script
python script.py

About

Perform Optical Character Recognition (OCR) on a scanned PDF file containing Arabic text and output a searchable PDF

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages