Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to fetch all text data & Not able to extract text, table data in proper format #205

Open
reema93jain opened this issue Jan 31, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@reema93jain
Copy link

reema93jain commented Jan 31, 2024

Hi Team,

I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format

Issues:
1)It seems like model is not recognizing all of text data properly
2) While extracting data in .txt format , it appears that:
a)I am not bale to print text data in sequence as it appears on pdf
b) I am not able to extract table data in tabular format

Can you please suggest how I can resolve above issues? Thank you!

Code:
Install necessary libraries
#install detectron2:
!pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'
#install layoutparser
!pip install layoutparser
!pip install layoutparser[ocr]
##install opencv, numpy, matplotlib
!pip install opencv-python numpy matplotlib
!pip3 install pdf2image
!sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
!apt-get install poppler-utils
!pip install --upgrade google-cloud-vision
!pip uninstall google-cloud-vision
!pip install google-cloud-vision
!apt install tesseract-ocr
!apt install libtesseract-dev
!pip install pytesseract

import os
from pdf2image import convert_from_path
import shutil
import cv2
import numpy as np
import layoutparser as lp
from pdf2image import convert_from_path

Define Pdf_path

pdf_file='7050X_Q_A.pdf'

Define your output file name here

output_file = 'output.txt'

with open(output_file, 'w', encoding='utf-8') as f:
for i, page_img in enumerate(convert_from_path(pdf_file)):
img = np.asarray(page_img)

    model3 = lp.models.Detectron2LayoutModel(
        'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
        extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
        label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
    )

    layout_result3 = model3.detect(img)

    text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])

    h, w = img.shape[:2]

    left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)

    left_blocks = text_blocks.filter_by(left_interval, center=True)
    left_blocks.sort(key=lambda b: b.coordinates[1])

    right_blocks = [b for b in text_blocks if b not in left_blocks]
    right_blocks.sort(key=lambda b: b.coordinates[1])

    text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
    viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
    display(viz)
    ocr_agent = lp.TesseractAgent(languages='eng')
    for block in text_blocks:
           segment_image = (block
                            .pad(left=5, right=5, top=5, bottom=5)
                            .crop_image(img))

           text = ocr_agent.detect(segment_image)
           block.set(text=text, inplace=True)

        # Write text to the output file
    for txt in text_blocks.get_texts():
        #print(txt, end='\n---\n')
        f.write(txt + '\n---\n')

print("Text extraction completed. Check the output file:", output_file)

Environment

  1. Windows
  2. Layout Parser & layoutparser[ocr] version 0.3.4
  3. PyTorch version: 2.1.0+cu121
    !pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
  4. google-cloud-vision-3.5.0
  5. google-api-core Version: 2.11.1
    6.Python 3.10.6

Thanks
Reema Jain

@reema93jain reema93jain added the bug Something isn't working label Jan 31, 2024
@reema93jain
Copy link
Author

Hi Team,

Can someone please help on resolving above issue?

Thank you for the help!
Reema Jain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant