[Bug]: OCRmyPDF does not preserve existing XMP metadata #1220

jkorinth · 2023-12-27T12:20:55Z

Describe the bug

OCRmyPDF does not preserve XMP metadata tags, e.g., from the Dublin Core set, like contributor, created, subject.

Steps to reproduce

1. ocrmypdf --output-type pdfa --skip-text --language eng --rotate-pages-threshold 12.0 annotated.pdf test.pdf
2. exiftool annotated.pdf | sort
3. exiftool test.pdf | sort

Below is a diff of the output of exiftool showing that several tags are missing from the OCRmyPDF output.

Files

annotated.pdf
test.pdf

--- annotated.meta	2023-12-27 13:00:29.061991775 +0100
+++ test.meta	2023-12-27 12:59:14.375328088 +0100
@@ -1,22 +1,27 @@
-Contributor                     : Test Contributor
-Created                         : 2023-12-25
+Author                          : 
+Conformance                     : B
+Create Date                     : 2023:12:25 17:14:41+01:00
+Creator                         : 
+Creator Tool                    : ocrmypdf 16.0.2 / Tesseract OCR-PDF 5.3.3
 Directory                       : .
+Document ID                     : uuid:4224de48-db5d-11f9-0000-daf39fd7444b
 ExifTool Version Number         : 12.70
-File Access Date/Time           : 2023:12:25 17:08:45+01:00
-File Inode Change Date/Time     : 2023:12:25 17:08:39+01:00
-File Modification Date/Time     : 2023:12:25 17:08:39+01:00
-File Name                       : annotated.pdf
+File Access Date/Time           : 2023:12:27 12:58:05+01:00
+File Inode Change Date/Time     : 2023:12:25 17:14:41+01:00
+File Modification Date/Time     : 2023:12:25 17:14:41+01:00
+File Name                       : test.pdf
 File Permissions                : -rw-r--r--
-File Size                       : 14 kB
+File Size                       : 6.7 kB
 File Type Extension             : pdf
 File Type                       : PDF
+Format                          : application/pdf
+Language                        : en
 Linearized                      : No
-Metadata Date                   : 2023:12:25 16:08:39.337381+00:00
+Metadata Date                   : 2023:12:25 16:14:41.291096+00:00
 MIME Type                       : application/pdf
+Modify Date                     : 2023:12:25 16:14:41+00:00
 Page Count                      : 1
-PDF Version                     : 1.5
+Part                            : 2
+PDF Version                     : 1.7
 Producer                        : pikepdf 8.10.1
-PTEX Fullbanner                 : This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/Arch Linux) kpathsea version 6.3.5
-Subject                         : Subject
-Trapped                         : False
-XMP Toolkit                     : pikepdf
+XMP Toolkit                     : XMP toolkit 2.9.1-13, framework 1.6

How did you download and install the software?

Docker container

OCRmyPDF version

16.0.2

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2023-12-28T22:56:15Z

When run with higher verbosity -v 1 --output-type pdfa ocrmypdf logs the following:

Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                                                                                                      _metadata.py:62
The following metadata fields were not copied: {'{http://purl.org/dc/elements/1.1/}created', '{http://purl.org/dc/elements/1.1/}contributor', '{http://ns.adobe.com/xap/1.0/}MetadataDate', '{http://purl.org/dc/elements/1.1/}subject'}                                                                  _metadata.py:67

It looks like you used pikepdf to set dc:contributor, and set it to a "singleton" text string. pikepdf does not block you from setting metadata to a type that is not consistent with the XML schema, unfortunately. exiftool displays it even though it's the wrong type as a best-effort fallback I suppose.

Depending what you're doing you could also use libexempi3 (python-xmp-toolkit) which is a more comprehensive implementation of the XMP spec, but also very difficult to use in my experience. (When it comes down to it, XMP is a ridiculously overengineered spec, so there's only so much one can wrangle its complexity into a clean interface.) There are some complex XMP data structures that pikepdf cannot generate.

dc:contributor's type is rdf:Bag - that is, an unordered list/set - there are potentially multiple contributors to a work and no priority is assumed. If you assign it using a set, pikepdf will generate a rdf:Bag and the correct metadata is generated.

In [5]: with p.open_metadata() as m:
   ...:     del m['dc:contributor']
   ...:     m['dc:contributor'] = {'Contributor One', 'Contributor Two'}
   ...: 

In [6]: p.save('issuepdf/1220.fixed.pdf')

It's actually Ghostscript that silently strips out incorrect metadata when it is run. Then OCRmyPDF reports that some metadata was missing.

Using the procedure above you can determine appropriate types for the other metadata fields of interesting and fix them.

Since OCRmyPDF warns about removal of metadata, there's nothing to fix in its codebase. I could see adding an enhancement to pikepdf to warn about assigning wrong types for the most important metadata fields (Dublin Core, mainly).

jkorinth · 2024-01-17T21:16:27Z

Thanks a lot for the detailed answer! And yeah, I agree, XMP seems to be one of these overengineered XML specs. 😄

jkorinth · 2024-01-18T06:40:29Z

Just tried it again with this pikepdf snippet to create the metadata:

import pikepdf
import sys
from datetime import datetime
from pikepdf.models.metadata import encode_pdf_date

d = encode_pdf_date(datetime(year=2023, month=12, day=25))
pdf = pikepdf.open(sys.argv[1])
with pdf.open_metadata() as meta:
    meta['dc:contributor'] = { "Test Contributor" }
    meta['dc:title'] = "Title"
    meta['dc:created'] = d

pdf.save(sys.argv[2])

The metadata generated by pikepdf looks ok:

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="pikepdf">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rdf:Description rdf:about=""><dc:contributor xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Bag><rdf:li>Test Contributor</rdf:li></rdf:Bag></dc:contributor></rdf:Description><rdf:Description rdf:about=""><dc:title xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Alt><rdf:li xml:lang="x-default">Title</rdf:li></rdf:Alt></dc:title></rdf:Description><rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="" dc:created="D:20231225000000"/><rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="" xmp:MetadataDate="2024-01-18T06:31:13.412391+00:00"/><rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="" pdf:Producer="pikepdf 8.10.1"/></rdf:RDF>
</x:xmpmeta>

Still, the dc:contributor and dc:created get dropped:

The following metadata fields were not copied:                                             _metadata.py:67
{'{http://purl.org/dc/elements/1.1/}contributor',                                                         
'{http://purl.org/dc/elements/1.1/}created', '{http://ns.adobe.com/xap/1.0/}MetadataDate'}

I'd also rather use dc:subject instead of dc:title, but it also gets dropped. 😦

jkorinth added the bug label Dec 27, 2023

jkorinth assigned jbarlow83 Dec 27, 2023

jbarlow83 closed this as completed Dec 28, 2023

jbarlow83 mentioned this issue Jan 18, 2024

Add type checking for setting XMP metadata pikepdf/pikepdf#555

Open

jbarlow83 reopened this Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OCRmyPDF does not preserve existing XMP metadata #1220

[Bug]: OCRmyPDF does not preserve existing XMP metadata #1220

jkorinth commented Dec 27, 2023

jbarlow83 commented Dec 28, 2023 •

edited

jkorinth commented Jan 17, 2024

jkorinth commented Jan 18, 2024

[Bug]: OCRmyPDF does not preserve existing XMP metadata #1220

[Bug]: OCRmyPDF does not preserve existing XMP metadata #1220

Comments

jkorinth commented Dec 27, 2023

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

jbarlow83 commented Dec 28, 2023 • edited

jkorinth commented Jan 17, 2024

jkorinth commented Jan 18, 2024

jbarlow83 commented Dec 28, 2023 •

edited