Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte #577

Closed
amargerandhw opened this issue Apr 24, 2024 · 3 comments

Comments

@amargerandhw
Copy link

Calling annotation.AP.N.keys() on radio buttons with options containing accentuated characters such as é, è, ê, etc throws the following error :

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

Should the options in src/core/object.cpp:610 :

.def("keys",
            [](QPDFObjectHandle &h) {
                if (h.isStream())
                    return h.getDict().getKeys();
                return h.getKeys();
            })

be encoded before returning them or there is a way to encode them in the same time as calling the function in python so I can get all the available options ?

As an example on a radio button I have the following options :

  • /Célibataire
  • /Marié(e)
  • /Off"
  • /Pacsé(e)
  • /Union Libre

Actually I can do the following :

try:
    print(str(kid.AP.N.keys))
except UnicodeDecodeError as e:
    print(e)

And I have these logs :

'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
{'/Off', '/Union Libre'}
'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte
@amargerandhw
Copy link
Author

amargerandhw commented Apr 24, 2024

I found a solution to fetch the options :

            states = set()
            try:
                for key in kid.AP.N:
                    states.update([key])
            except UnicodeDecodeError as e:
                states.update([e.object.decode('iso-8859-1')])

But I have another question, I would like to fill the radio with a value containing accentuated characters.

I tried this way :

            values = {}
            try:
                for key in kid.AP.N:
                    values[key] = key
            except UnicodeDecodeError as e:
                values[e.object.decode('iso-8859-1')] = e.object
            if str(value) in values.keys():
                value = Name(str(values[str(value)], encoding='pdfdoc', errors='replace'))
                kid.AS=value
            elif '/AS' in kid:
                kid.AS=Name.Off

but it does not work

I also tried to use the "original" value :

Name(values[str(value)])

it does not work as well because its bytes

Name(str(values[str(value)], encoding='pdfdoc'))

also does not work because for python it does not start with /

Do you have any ideas of how to do that ?

@amargerandhw
Copy link
Author

amargerandhw commented Apr 25, 2024

@jbarlow83

When I comment out the part that refuses bytes like this

Capture d’écran 2024-04-25 à 14 48 56

https://github.com/pikepdf/pikepdf/blob/main/src/pikepdf/objects.py#L101

I can successfully set the radio button value using the "original" value :

values = {}
try:
    for key in kid.AP.N:
        values[key] = key
except UnicodeDecodeError as e:
    values[e.object.decode('iso-8859-1')] = e.object
if str(value) in values.keys():
    Name(values[str(value)])
    kid.AS=value
elif '/AS' in kid:
    kid.AS=Name.Off

Is there a reason why bytes are not allowed ? If there is what is the support method to update a field with a value containing accentuated characters ?

@jbarlow83
Copy link
Member

jbarlow83 commented Apr 26, 2024

The input PDF is malformed in a way that pikepdf cannot correct for.

Analysis

In PDF, Dictionary objects are key value maps like Python dict, except that the key is restricted in that it must be a PDF Name object.

A Name object is denoted by beginning with a / and what follows must encoded in a specific way. A Name cannot store arbitrary bytes; specifically it cannot store the null character.

The process of encoding a Name is:

  • convert str to utf-8 bytes
  • replace all b'#' with b'#23'
  • replace all b'/' with b'#2f'
  • replace all characters < 0x21 or > 0x7e with, e.g. b'#20' or b'#7f'
    (After this transformation, the encoded form of a Name contains only bytes 0x21 to 0x7e.)

So the expected encoding of Célibataire as seen in a hex editor should be:

>>> re.sub(
    br'[^\x21-\x7e]', lambda m: (b'#' + hex(ord(m.group(0))).upper()[2:].encode()), 'Célibataire'.encode('utf-8')
) # this regex does not handle all cases of encoding, I just developed it in exploring the issue
b'C#C3#A9libataire'

The error messages suggests your file has one of these two encoding errors

b'C#E9libataire'   # valid name, encoding but not valid utf-8
or
b'C\xE9libataire'   # invalid name

Big picture

AP.N does not contain the user visible text - at least not from what I can see in the PDF manual. The names in this dictionary are "slugs" used internally by the PDF. The printed text should be stored at AP.N[Name.button1].T, which is allowed to be an arbitrary Unicode string. If dictionary were created correctly the issue would not occur.

I can see this is frustrating, but it looks like the input is a malformed file that needs forensic repair.

Design note

For better or for worse, pikepdf automatically converts str to pikepdf.Name when you interact with pikepdf.Dictionary, which means for malformed input files you get difficult exceptions like in this issue.

In retrospect this was probably a design mistake on my part - in a way it is analogous to automatic str/bytes conversion in Python 2. Initially I wanted to make porting pypdf2 code to pikepdf as easy as possible, and pypdf2 does the same.. At some point I will probably deprecate this and force pikepdf.Dictionary to function as strictly a MutableMapping[pikepdf.Name, pikepdf.Object]. That would probably make some issues like this more obvious to the library user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants