Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position... while reading Microsoft Access .mdb file #39

Open
ryszard314159 opened this issue Oct 16, 2020 · 1 comment

Comments

@ryszard314159
Copy link

ryszard314159 commented Oct 16, 2020

I am trying to read Microsoft Access [.mdb] file (created by ChemFinder on Windows), but I am getting
UnicodeDecodeError: 'utf-8' codec... error despite specifying encoding as recovered by meza.io.get_encoding()
to be TIS-620

I would appreciate any suggestions...

Details below:

import meza
fn = 'test.mdb'
encoding = meza.io.get_encoding(fn)
print(enc) # TIS-620
records = meza.io.read_mdb(fn, encoding=enc)
z = list(records)
~/anaconda3/lib/python3.8/site-packages/meza/io.py in read_mdb(filepath, table, **kwargs)
    636     # https://stackoverflow.com/a/17698359/408556
    637     with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:
--> 638         first_line = StringIO(str(pipe.readline()))
    639         names = next(csv.reader(first_line, **kwargs))
    640         uscored = ft.underscorify(names) if sanitize else names

~/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 70: invalid start byte

I tried to pass encoding to pkwargs used in Popen (in meza/io.py)

    pkwargs = {'stdout': PIPE, 'bufsize': 1, 'universal_newlines': True}
--> pkwargs['encoding'] = kwargs.get('encoding', None)

    # https://stackoverflow.com/a/2813530/408556
    # https://stackoverflow.com/a/17698359/408556
    with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:

but it does not resolve the issue. With this modification I am getting:

UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 327: character maps to <undefined>
@reubano
Copy link
Owner

reubano commented Dec 23, 2021

Can you send me a file to test with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants