Delete or prune StructTreeRoot for --force-ocr/--redo-ocr/--skip-text and post warnings #1163

jbarlow83 · 2023-10-09T09:56:25Z

If a document has StructTreeRoot it's even more likely we're dealing with a bona fide OCR + tagged document, but the user can still request --force-ocr. Although there may be tricky cases like a mixed content document of digital output + scanned. This has interaction with --pages as well.

First, we need to warn if this object is present (and covering all requested --pages?) since it's a stronger warning against OCR.

Currently using --force-ocr will leave an invalid StructTreeRoot full of pointers to deleted objects. If we're doing --force-ocr we should delete all objects in the tree on each processed page.

For --skip-text we ought to leave StructTreeRoot intact. Except in contrived cases, the StructTreeRoot will not reference any pages with known text.

For --redo-ocr we also need to discard all objects, because our new objects may not match the old ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete or prune StructTreeRoot for --force-ocr/--redo-ocr/--skip-text and post warnings #1163

Delete or prune StructTreeRoot for --force-ocr/--redo-ocr/--skip-text and post warnings #1163

jbarlow83 commented Oct 9, 2023

Delete or prune StructTreeRoot for --force-ocr/--redo-ocr/--skip-text and post warnings #1163

Delete or prune StructTreeRoot for --force-ocr/--redo-ocr/--skip-text and post warnings #1163

Comments

jbarlow83 commented Oct 9, 2023