Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/chipper repetitions #314

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open

Feat/chipper repetitions #314

wants to merge 41 commits into from

Conversation

ajjimeno
Copy link
Contributor

@ajjimeno ajjimeno commented Jan 2, 2024

In some cases Chipper repeats elements. This PR has additional mechanisms to detect these repetitions and provides mechanisms for filtering repetitions that cannot be identified during decoding.

Repetition detection:

  • Tables benefit from beam search size = 3. When a table is detected using beam search size = 1, the generation restarts with beam search size = 3
  • To avoid interacting with genuine repetitions, a context windows has been defined to avoid looking for repeated text in all the generated elements.
  • Specific mechanism has been added to detect when repetitions happen in tables, which were not detected before.

Additional filtering:

  • Remove empty tables
  • Remove Picture elements that get repeated
  • Remove repeated texts, this uses the bounding boxes and the matching on the elements text to identify repetitions

Here are two example images. One is processed as a table, since it was modelled like that in the ground truth, the problem is that the table is never finished, chipper generates unlimited "" pair of tokens. The second has some repetitions that may happen due to the images in the document. With the proposed PR, these repetitions disappear.

The PDF document made Chipper generate repetitions but only under Linux.

Example code for the images.

from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model

image_file_name = [change with image path]

model = get_model("chipper")
doc = DocumentLayout.from_image_file(
    image_file_name,
    detection_model=model,
)

print(*[element.__dict__ for element in doc.pages[0].elements])

print(*[element for element in doc.pages[0].elements], sep="\n")

Example code for the PDF file:

from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model

pdf_file_name = [change with PDF path]

model = get_model("chipper")
doc = DocumentLayout.from_file(
    pdf_file_name,
    detection_model=model,
    pdf_image_dpi=300,
 )

print(*[element.__dict__ for element in doc.pages[0].elements])

print(*[element for element in doc.pages[0].elements], sep="\n")

RAND_RRA2977-1.pdf

286089231-8fef08e1-b3c6-4795-8746-1256af5288e7

286091300-aff403ef-4488-4e8e-8029-c2afa43cb355

A previous PR had issues with unstructured when running some mini holistic documents. This was due to a problem with bounding boxes and as well with nested elements (e.g. List and List-items). An exception would happen when running the mini holistic documents below. The example code below allows testing it. You might need to have to use the main branch in unstructured since recently a bug related to layout elements was identified and a different exception might happen. The code prints the unstructured version, so it is possible to check which version is being used. The output of the processing is stored in the out.un.json file.

from unstructured import __version__
print(__version__.__version__)

import json
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json

elements = json.loads(elements_to_json(partition([FILENAME], strategy="hi_res", model_name="chipper")))

with open('out.un.json', 'w') as w:
    json.dump(elements, w)

soldering-iron-manual.pdf

TX-penal-code-T8-36-1-3.pdf

@ajjimeno ajjimeno marked this pull request as ready for review January 3, 2024 05:56
@badGarnet
Copy link
Collaborator

tested the PR with the soldering pdf locally and found compared to main (file 1) we got:

  • repeated element for "Determining if Heating Element has Expired and Needs to be Replaced"
  • repeated element for "C. A "Beep" means the element is good. No beep means the heating element has expired and needs replacing."
  • revision of two figure caption elements (correctly) into Image elements
  • better url text (no extra space)
    main.json
    pr.json
    diff.txt

ajjimeno and others added 3 commits January 31, 2024 15:27
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <theyaoyou@gmail.com>
@ajjimeno
Copy link
Contributor Author

@badGarnet Just in case it is relevant. When comparing the main branch and this one, "chipper" points to "chipperv2" in the main branch and to "chipperv3" in this branch. Probably you realised about it, but just in case.

@qued
Copy link
Contributor

qued commented Jan 31, 2024

I confirmed what Yao found, that for the file soldering-iron-manual.pdf on page 3, there are no repeated elements on the main branch while repeated elements are introduced on this branch.

This is concerning because it's one example of the PR making the problem worse. It may be a one-off and this PR may fix more cases than it breaks, but it suggests we need to look at the statistics to determine whether this makes things better or worse.

@ajjimeno
Copy link
Contributor Author

ajjimeno commented Feb 1, 2024

@qued @badGarnet repetition has been solved. There was a condition to move from the initial beam search size = 1 to beam search size = 3 that I tried to optimize. I reverted it to the original implementation of NGramRepetitonStoppingCriteria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants