1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf #3494

cbm755 · 2024-05-16T23:14:16Z

Description of the bug

Maybe a duplicate or at least related to #3470. When I use insert_pdf to copy a page from one PDF to a new document, then use subset_fonts on the new document, I get spurious letters. In my example below, its an "E". But we've also seen "M".

How to reproduce the bug

MWE:

import fitz

doc_in = fitz.open("version1.pdf")

d = fitz.open()

d.insert_pdf(
    doc_in,
    from_page=0,
    to_page=0,
    start_at=-1,
)

d.subset_fonts()

d.save("output.pdf")

version1.pdf is: version1.pdf

PyMuPDF version

1.24.3

Operating system

Linux

Python version

3.12

The text was updated successfully, but these errors were encountered:

cbm755 · 2024-05-16T23:15:31Z

I can also reproduce this with 1.24.2 but NOT with 1.24.1.

cbm755 · 2024-05-16T23:16:09Z

Downstream link: https://gitlab.com/plom/plom/-/issues/3374

cbm755 · 2024-05-16T23:19:26Z

In this image, the spurious letters can take fonts that come from a different page of the document.

JorjMcKie · 2024-05-16T23:25:33Z

This is indeed related to #3470 and will thus be solved with the next PyMuPDF version containing this MuPDF fix.

JorjMcKie · 2024-05-17T00:59:50Z

Just in case you are not aware:
Creating font subsets has been implemented in MuPDF. While it still is officially an experimental feature, we as PyMuPDF are very interested in replacing the current solution - which is pure Python-based and creates an external dependency on another package (fontTools).
So we have a vital interest to deprecate this solution short to medium term.
There are also some secondary advantages:

The MuPDF solution is at least 15 times faster and it covers a larger set of font types compared to fontTools (which is restricted to TTF and OTF formats).
Being a MuPDF solution, not only MuPDF itself, but all its language bindings will immediately benefit from it. These are currently Java, JavaScript and the new C# bindings, MuPDF.net, which is on the verge to be published this or early next week.

Given this background, we will not continue fixing any issues around the fontTools-based solution.

cbm755 · 2024-05-17T14:08:15Z

Thanks sounds very promising! Is there a timeline or issue I can follow?

I the meantime, perhaps I'll try to scale back our use of subset_fonts() to only those cases where we used PyMuPDF to add non-ASCII text.

JorjMcKie · 2024-05-17T14:28:20Z

We are testing this feature for a considerable time now. That new fix should actually be it.
To elevate it from the "experimental" label, I mean.
We nonetheless will provide the fallback option for some more time of course.

For my own purposes, I am using the new version all the time.
Especially if you use Page.insert_htmlbox or other Story-based code, MuPDF is likely to pull in needed fonts all over the place. Here, subset fonts can be a life saver. Have a look at this (certainly extreme) example. Not using font subsets lets you save a 2 MB file, subsets reduce it to 80 KB.

MuPDF just recently has introduced rich text support for FreeText annotations (not yet supported in PyMuPDF). And the technique used is ... again the Story class!

Fixes Issue #3374, by falling back on the deprecated in-python fonttools based technique for doing subsetting. To be removed once the new MuPDF-based code is a little more mature, or at least once [1, 2] are fixed. [1] pymupdf/PyMuPDF#3470 [2] pymupdf/PyMuPDF#3494

cbm755 changed the title ~~1.24.3: spurious "E" introduced when using subset_fonts and insert_pdf~~ 1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf May 16, 2024

JorjMcKie added the fix developed release schedule to be determined label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf #3494

1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf #3494

cbm755 commented May 16, 2024

cbm755 commented May 16, 2024

cbm755 commented May 16, 2024

cbm755 commented May 16, 2024

JorjMcKie commented May 16, 2024

JorjMcKie commented May 17, 2024

cbm755 commented May 17, 2024

JorjMcKie commented May 17, 2024

1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf #3494

1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf #3494

Comments

cbm755 commented May 16, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

cbm755 commented May 16, 2024

cbm755 commented May 16, 2024

cbm755 commented May 16, 2024

JorjMcKie commented May 16, 2024

JorjMcKie commented May 17, 2024

cbm755 commented May 17, 2024

JorjMcKie commented May 17, 2024