Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'charmap' codec can't encode character #166

Open
swahareddy opened this issue Aug 23, 2020 · 2 comments
Open

UnicodeEncodeError: 'charmap' codec can't encode character #166

swahareddy opened this issue Aug 23, 2020 · 2 comments

Comments

@swahareddy
Copy link

This was my command python photon.py -u "https://en.wikipedia.org/wiki/Tom_Crean_(explorer)" -l 2
and this was the output:

    ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ v1.3.2

 Level 1: 1 URLs
 Progress: 1/1
 Level 2: 478 URLs
 Progress: 478/478
 Crawling 1 JavaScript files
 Progress: 1/1
Traceback (most recent call last):
  File "photon.py", line 385, in <module>
    writer(datasets, dataset_names, output_dir)
  File "C:\Users\Tejaswa\Documents\GitHub\Photon\core\utils.py", line 85, in writer
    out_file.write(str(joined.encode('utf-8').decode('utf-8')))
  File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0142' in position 17758: character maps to <undefined>

I (think I) added a mapping in lib\encodings\cp1252.py by doing:

    .
    .
    '\xff'     #  0xFF -> LATIN SMALL LETTER Y WITH DIAERESIS
    '\u0142'     #  0xFF -> LATIN SMALL LETTER L WITH DIAERESIS
)

### Encoding table
encoding_table=codecs.charmap_build(decoding_table)

But I doubt this is correct (the hex values are maxed out at \xff too)

Is there any parameter to ignore such encoding problems that I can specify with photon itself? Or some underlying file to edit?

Thanks

@ege-del
Copy link

ege-del commented Mar 7, 2021

check my pull request #178
this option might fix your problem
--encoding-error "ignore"
also check other possible values
https://docs.python.org/3/library/stdtypes.html#str.encode

@DaveCrim
Copy link

I encountered a similar issue and tracked it to the writer function in utils.py line 83, fixed it like this:
with open(filepath, 'w+', encoding='utf-8') as out_file:

I'm not smart enough to do pull requests or anything...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants