Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete or prune StructTreeRoot for --force-ocr/--redo-ocr/--skip-text and post warnings #1163

Open
jbarlow83 opened this issue Oct 9, 2023 · 0 comments

Comments

@jbarlow83
Copy link
Collaborator

If a document has StructTreeRoot it's even more likely we're dealing with a bona fide OCR + tagged document, but the user can still request --force-ocr. Although there may be tricky cases like a mixed content document of digital output + scanned. This has interaction with --pages as well.

First, we need to warn if this object is present (and covering all requested --pages?) since it's a stronger warning against OCR.

Currently using --force-ocr will leave an invalid StructTreeRoot full of pointers to deleted objects. If we're doing --force-ocr we should delete all objects in the tree on each processed page.

For --skip-text we ought to leave StructTreeRoot intact. Except in contrived cases, the StructTreeRoot will not reference any pages with known text.

For --redo-ocr we also need to discard all objects, because our new objects may not match the old ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant