New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: OCRmyPDF does not preserve existing XMP metadata #1220
Comments
When run with higher verbosity
It looks like you used pikepdf to set dc:contributor, and set it to a "singleton" text string. pikepdf does not block you from setting metadata to a type that is not consistent with the XML schema, unfortunately. exiftool displays it even though it's the wrong type as a best-effort fallback I suppose. Depending what you're doing you could also use libexempi3 (python-xmp-toolkit) which is a more comprehensive implementation of the XMP spec, but also very difficult to use in my experience. (When it comes down to it, XMP is a ridiculously overengineered spec, so there's only so much one can wrangle its complexity into a clean interface.) There are some complex XMP data structures that pikepdf cannot generate. dc:contributor's type is rdf:Bag - that is, an unordered list/set - there are potentially multiple contributors to a work and no priority is assumed. If you assign it using a set, pikepdf will generate a rdf:Bag and the correct metadata is generated. In [5]: with p.open_metadata() as m:
...: del m['dc:contributor']
...: m['dc:contributor'] = {'Contributor One', 'Contributor Two'}
...:
In [6]: p.save('issuepdf/1220.fixed.pdf') It's actually Ghostscript that silently strips out incorrect metadata when it is run. Then OCRmyPDF reports that some metadata was missing. Using the procedure above you can determine appropriate types for the other metadata fields of interesting and fix them. Since OCRmyPDF warns about removal of metadata, there's nothing to fix in its codebase. I could see adding an enhancement to pikepdf to warn about assigning wrong types for the most important metadata fields (Dublin Core, mainly). |
Thanks a lot for the detailed answer! And yeah, I agree, XMP seems to be one of these overengineered XML specs. 😄 |
Just tried it again with this pikepdf snippet to create the metadata: import pikepdf
import sys
from datetime import datetime
from pikepdf.models.metadata import encode_pdf_date
d = encode_pdf_date(datetime(year=2023, month=12, day=25))
pdf = pikepdf.open(sys.argv[1])
with pdf.open_metadata() as meta:
meta['dc:contributor'] = { "Test Contributor" }
meta['dc:title'] = "Title"
meta['dc:created'] = d
pdf.save(sys.argv[2]) The metadata generated by pikepdf looks ok: <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="pikepdf">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""><dc:contributor xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Bag><rdf:li>Test Contributor</rdf:li></rdf:Bag></dc:contributor></rdf:Description><rdf:Description rdf:about=""><dc:title xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Alt><rdf:li xml:lang="x-default">Title</rdf:li></rdf:Alt></dc:title></rdf:Description><rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="" dc:created="D:20231225000000"/><rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="" xmp:MetadataDate="2024-01-18T06:31:13.412391+00:00"/><rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="" pdf:Producer="pikepdf 8.10.1"/></rdf:RDF>
</x:xmpmeta> Still, the
I'd also rather use |
Describe the bug
OCRmyPDF does not preserve XMP metadata tags, e.g., from the Dublin Core set, like
contributor
,created
,subject
.Steps to reproduce
Files
annotated.pdf
test.pdf
How did you download and install the software?
Docker container
OCRmyPDF version
16.0.2
Relevant log output
No response
The text was updated successfully, but these errors were encountered: