🚀 Feature: OCR #924

Fagner-lourenco · 2024-04-13T06:39:31Z

🔖 Feature description

I have a suggestion to enable PDF file ingestion with OCR. I am studying the project to use in the legal field. However, many documents are non-searchable text in images, requiring OCR processing to extract the text. In this case, if the number of characters extracted is less than X, it triggers OCR.

🎤 Why is this feature needed ?

I wrote this code, but I am an amateur. I did not consider the issue of speed and performance. It would be interesting if you analyzed and implemented these functionalities in an optimized way to not affect performance. In this case, I thought of a code that checks if the standard text extraction has fewer than X characters. If it does, it means that there is likely an image on that page, triggering the OCR. Does it make sense?

✌️ How do you aim to achieve this?

docs_parser.py

from pathlib import Path
from typing import Dict

from application.parser.file.base_parser import BaseParser
import fitz # PyMuPDF
from pdf2image import convert_from_path
import pytesseract
from PIL import Image

class PDFParser(BaseParser):
"""PDF parser with optional OCR support."""

def __init__(self, use_ocr: bool = False, ocr_threshold: int = 10):
    """
    Initializes the PDF parser.
    :param use_ocr: Flag to enable OCR for pages that don't have enough extractable text.
    :param ocr_threshold: The minimum length of text to attempt OCR.
    """
    self.use_ocr = use_ocr
    self.ocr_threshold = ocr_threshold

def _init_parser(self) -> Dict:
    """Init parser."""
    return {}

def parse_file(self, file: Path, errors: str = "ignore") -> str:
    """Parse file."""
    text_list = []
    pdf = fitz.open(file)

    for page_num in range(len(pdf)):
        page = pdf.load_page(page_num)
        page_text = page.get_text()
        
        # Check if page text is less than the threshold
        if self.use_ocr and len(page_text) < self.ocr_threshold:
            page_text = self._extract_text_with_ocr(page)

        text_list.append(page_text)

    text = "\n".join(text_list)
    return text

def _extract_text_with_ocr(self, page) -> str:
    """
    Extracts text from a PDF page using OCR.
    :param page: The PDF page from PyMuPDF.
    :return: Extracted text using OCR.
    """
    pix = page.get_pixmap()
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    ocr_text = pytesseract.image_to_string(img)
    return ocr_text

🔄️ Additional Information

No response

👀 Have you spent some time to check if this feature request has been raised before?

I checked and didn't find similar issue

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

Fagner-lourenco · 2024-04-30T00:43:12Z

I am unable to optimize the tool and make a git pull request. The function worked on my computer, but very slowly. If anyone can take on this improvement, I would be grateful. I believe it will be a substantial optimization of the tool, not only for me but for several other usage scenarios.

@dartpain

dartpain · 2024-04-30T09:42:37Z

Appreciate your try @Fagner-lourenco

dartpain assigned Fagner-lourenco Apr 13, 2024

dartpain unassigned Fagner-lourenco Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Feature: OCR #924

🚀 Feature: OCR #924

Fagner-lourenco commented Apr 13, 2024

Fagner-lourenco commented Apr 30, 2024

dartpain commented Apr 30, 2024

🚀 Feature: OCR #924

🚀 Feature: OCR #924

Comments

Fagner-lourenco commented Apr 13, 2024

🔖 Feature description

🎤 Why is this feature needed ?

✌️ How do you aim to achieve this?

🔄️ Additional Information

👀 Have you spent some time to check if this feature request has been raised before?

Are you willing to submit PR?

Fagner-lourenco commented Apr 30, 2024

dartpain commented Apr 30, 2024