Feat/chipper repetitions #314

ajjimeno · 2024-01-02T22:20:09Z

In some cases Chipper repeats elements. This PR has additional mechanisms to detect these repetitions and provides mechanisms for filtering repetitions that cannot be identified during decoding.

Repetition detection:

Tables benefit from beam search size = 3. When a table is detected using beam search size = 1, the generation restarts with beam search size = 3
To avoid interacting with genuine repetitions, a context windows has been defined to avoid looking for repeated text in all the generated elements.
Specific mechanism has been added to detect when repetitions happen in tables, which were not detected before.

Additional filtering:

Remove empty tables
Remove Picture elements that get repeated
Remove repeated texts, this uses the bounding boxes and the matching on the elements text to identify repetitions

Here are two example images. One is processed as a table, since it was modelled like that in the ground truth, the problem is that the table is never finished, chipper generates unlimited "" pair of tokens. The second has some repetitions that may happen due to the images in the document. With the proposed PR, these repetitions disappear.

The PDF document made Chipper generate repetitions but only under Linux.

Example code for the images.

from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model

image_file_name = [change with image path]

model = get_model("chipper")
doc = DocumentLayout.from_image_file(
    image_file_name,
    detection_model=model,
)

print(*[element.__dict__ for element in doc.pages[0].elements])

print(*[element for element in doc.pages[0].elements], sep="\n")

Example code for the PDF file:

from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model

pdf_file_name = [change with PDF path]

model = get_model("chipper")
doc = DocumentLayout.from_file(
    pdf_file_name,
    detection_model=model,
    pdf_image_dpi=300,
 )

print(*[element.__dict__ for element in doc.pages[0].elements])

print(*[element for element in doc.pages[0].elements], sep="\n")

RAND_RRA2977-1.pdf

A previous PR had issues with unstructured when running some mini holistic documents. This was due to a problem with bounding boxes and as well with nested elements (e.g. List and List-items). An exception would happen when running the mini holistic documents below. The example code below allows testing it. You might need to have to use the main branch in unstructured since recently a bug related to layout elements was identified and a different exception might happen. The code prints the unstructured version, so it is possible to check which version is being used. The output of the processing is stored in the out.un.json file.

from unstructured import __version__
print(__version__.__version__)

import json
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json

elements = json.loads(elements_to_json(partition([FILENAME], strategy="hi_res", model_name="chipper")))

with open('out.un.json', 'w') as w:
    json.dump(elements, w)

soldering-iron-manual.pdf

TX-penal-code-T8-36-1-3.pdf

…/unstructured-inference into feat/chipper-repetitions

unstructured_inference/models/chipper.py

badGarnet · 2024-01-31T02:15:49Z

tested the PR with the soldering pdf locally and found compared to main (file 1) we got:

repeated element for "Determining if Heating Element has Expired and Needs to be Replaced"
repeated element for "C. A "Beep" means the element is good. No beep means the heating element has expired and needs replacing."
revision of two figure caption elements (correctly) into Image elements
better url text (no extra space)
main.json
pr.json
diff.txt

Co-authored-by: Yao You <theyaoyou@gmail.com>

ajjimeno · 2024-01-31T09:14:21Z

@badGarnet Just in case it is relevant. When comparing the main branch and this one, "chipper" points to "chipperv2" in the main branch and to "chipperv3" in this branch. Probably you realised about it, but just in case.

qued · 2024-01-31T17:39:57Z

I confirmed what Yao found, that for the file soldering-iron-manual.pdf on page 3, there are no repeated elements on the main branch while repeated elements are introduced on this branch.

This is concerning because it's one example of the PR making the problem worse. It may be a one-off and this PR may fix more cases than it breaks, but it suggests we need to look at the statistics to determine whether this makes things better or worse.

ajjimeno · 2024-02-01T00:44:22Z

@qued @badGarnet repetition has been solved. There was a condition to move from the initial beam search size = 1 to beam search size = 3 that I tried to optimize. I reverted it to the original implementation of NGramRepetitonStoppingCriteria.

Antonio Jimeno Yepes and others added 29 commits November 22, 2023 16:40

New stop criterion and logits processor for tables

8032818

Revised repetition setting

6fc6a3a

Adding table processing

239d463

Hyperparameter selection

e09548a

Merge branch 'main' into feat/chipper-repetitions

78b255c

Moving cleaning in post-processing

2795e54

Moving cleaning in post-processing

312b264

Updated version

93cfbba

Added docstrings

3dc9003

IoU union_are == 0.0

91606d4

Linting

1dfc3d7

Linting

c571014

Fixed bbox == None in iou

5446658

Linting

a7de901

Merge branch 'main' into feat/chipper-repetitions

1422f99

Merge branch 'main' into feat/chipper-repetitions

a4c791e

Revised after merging with main

2811001

Revised version

380d92c

Additional tests

6845489

Fixing model selection

420202a

Moving element to cpu

787af10

Not remove nested list parent/children

0f5321f

Fixed bug on repetition removal

c7b7faa

Fixed bbox errors

d984279

Fixed bbox errors

e3b2123

Merge branch 'feat/chipper-repetitions' of github.com:Unstructured-IO…

491ec7d

…/unstructured-inference into feat/chipper-repetitions

Merge branch 'main' into feat/chipper-repetitions

081d5f9

Do not sort Chipper elements

4f56cbf

Updated version and fixed tests

6fdc6b4

ajjimeno marked this pull request as ready for review January 3, 2024 05:56

ajjimeno requested review from leah1985 and qued January 3, 2024 05:56

cragwolfe requested a review from badGarnet January 25, 2024 21:09