UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte #577

amargerandhw · 2024-04-24T12:24:21Z

Calling annotation.AP.N.keys() on radio buttons with options containing accentuated characters such as é, è, ê, etc throws the following error :

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

Should the options in src/core/object.cpp:610 :

.def("keys",
            [](QPDFObjectHandle &h) {
                if (h.isStream())
                    return h.getDict().getKeys();
                return h.getKeys();
            })

be encoded before returning them or there is a way to encode them in the same time as calling the function in python so I can get all the available options ?

As an example on a radio button I have the following options :

/Célibataire
/Marié(e)
/Off"
/Pacsé(e)
/Union Libre

Actually I can do the following :

try:
    print(str(kid.AP.N.keys))
except UnicodeDecodeError as e:
    print(e)

And I have these logs :

'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
{'/Off', '/Union Libre'}
'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte

The text was updated successfully, but these errors were encountered:

amargerandhw · 2024-04-24T15:59:57Z

I found a solution to fetch the options :

            states = set()
            try:
                for key in kid.AP.N:
                    states.update([key])
            except UnicodeDecodeError as e:
                states.update([e.object.decode('iso-8859-1')])

But I have another question, I would like to fill the radio with a value containing accentuated characters.

I tried this way :

            values = {}
            try:
                for key in kid.AP.N:
                    values[key] = key
            except UnicodeDecodeError as e:
                values[e.object.decode('iso-8859-1')] = e.object
            if str(value) in values.keys():
                value = Name(str(values[str(value)], encoding='pdfdoc', errors='replace'))
                kid.AS=value
            elif '/AS' in kid:
                kid.AS=Name.Off

but it does not work

I also tried to use the "original" value :

Name(values[str(value)])

it does not work as well because its bytes

Name(str(values[str(value)], encoding='pdfdoc'))

also does not work because for python it does not start with /

Do you have any ideas of how to do that ?

amargerandhw · 2024-04-25T12:52:28Z

@jbarlow83

When I comment out the part that refuses bytes like this

https://github.com/pikepdf/pikepdf/blob/main/src/pikepdf/objects.py#L101

I can successfully set the radio button value using the "original" value :

values = {}
try:
    for key in kid.AP.N:
        values[key] = key
except UnicodeDecodeError as e:
    values[e.object.decode('iso-8859-1')] = e.object
if str(value) in values.keys():
    Name(values[str(value)])
    kid.AS=value
elif '/AS' in kid:
    kid.AS=Name.Off

Is there a reason why bytes are not allowed ? If there is what is the support method to update a field with a value containing accentuated characters ?

jbarlow83 · 2024-04-26T06:50:06Z

The input PDF is malformed in a way that pikepdf cannot correct for.

Analysis

In PDF, Dictionary objects are key value maps like Python dict, except that the key is restricted in that it must be a PDF Name object.

A Name object is denoted by beginning with a / and what follows must encoded in a specific way. A Name cannot store arbitrary bytes; specifically it cannot store the null character.

The process of encoding a Name is:

convert str to utf-8 bytes
replace all b'#' with b'#23'
replace all b'/' with b'#2f'
replace all characters < 0x21 or > 0x7e with, e.g. b'#20' or b'#7f'
(After this transformation, the encoded form of a Name contains only bytes 0x21 to 0x7e.)

So the expected encoding of Célibataire as seen in a hex editor should be:

>>> re.sub(
    br'[^\x21-\x7e]', lambda m: (b'#' + hex(ord(m.group(0))).upper()[2:].encode()), 'Célibataire'.encode('utf-8')
) # this regex does not handle all cases of encoding, I just developed it in exploring the issue
b'C#C3#A9libataire'

The error messages suggests your file has one of these two encoding errors

b'C#E9libataire'   # valid name, encoding but not valid utf-8
or
b'C\xE9libataire'   # invalid name

Big picture

AP.N does not contain the user visible text - at least not from what I can see in the PDF manual. The names in this dictionary are "slugs" used internally by the PDF. The printed text should be stored at AP.N[Name.button1].T, which is allowed to be an arbitrary Unicode string. If dictionary were created correctly the issue would not occur.

I can see this is frustrating, but it looks like the input is a malformed file that needs forensic repair.

Design note

For better or for worse, pikepdf automatically converts str to pikepdf.Name when you interact with pikepdf.Dictionary, which means for malformed input files you get difficult exceptions like in this issue.

In retrospect this was probably a design mistake on my part - in a way it is analogous to automatic str/bytes conversion in Python 2. Initially I wanted to make porting pypdf2 code to pikepdf as easy as possible, and pypdf2 does the same.. At some point I will probably deprecate this and force pikepdf.Dictionary to function as strictly a MutableMapping[pikepdf.Name, pikepdf.Object]. That would probably make some issues like this more obvious to the library user.

jbarlow83 closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte #577

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte #577

amargerandhw commented Apr 24, 2024

amargerandhw commented Apr 24, 2024 •

edited

amargerandhw commented Apr 25, 2024 •

edited

jbarlow83 commented Apr 26, 2024 •

edited

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte #577

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte #577

Comments

amargerandhw commented Apr 24, 2024

amargerandhw commented Apr 24, 2024 • edited

amargerandhw commented Apr 25, 2024 • edited

jbarlow83 commented Apr 26, 2024 • edited

amargerandhw commented Apr 24, 2024 •

edited

amargerandhw commented Apr 25, 2024 •

edited

jbarlow83 commented Apr 26, 2024 •

edited