UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte #12

liwzhi · 2018-01-22T18:26:09Z

Hi,

I am trying to load Chinese pretrained word2vec,
word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) # C binary format

it throws this error.

wiwengweng · 2018-01-26T05:48:12Z

of cause the vector should be trained using the proper codec, it seems the model is trained in other coding environment. Can you check that.

lxw0109 · 2018-01-30T08:44:15Z

I have come across the same error, anybody help? Thank you ~

galuhsahid · 2018-01-30T14:53:53Z

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

lxw0109 · 2018-01-31T02:31:08Z

@galuhsahid Thank you so much, it works now. : )

anavaldi · 2018-03-09T10:55:29Z

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

Priya22 · 2018-03-18T23:12:41Z

Same error as @anavaldi . Any solution?

anavaldi · 2018-03-19T11:39:10Z

I solve this error by executing on my own word embeddings with the .sh file.

hinanmu · 2018-04-25T11:53:23Z

I have come across the same error. I changed gensim.models.KeyedVectors.load_word2vec_format（）
into gensim.models.Word2Vec.load() .Then it works

changhyub · 2018-04-26T07:11:32Z

@hinamu it works, Thanks

gilgtc · 2018-05-22T23:24:07Z

@anavaldi

I solve this error by executing on my own word embeddings with the .sh file.

What do you mean?

caitaozhan · 2019-01-17T08:06:04Z

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

I solved this issue by degrading my gensim version from 3.6 to 3.0

kusumlata123 · 2019-06-18T05:44:14Z

UnpicklingError Traceback (most recent call last)
in ()
3 #model=gensim.models.Word2Vec.load_word2vec_format('model_file', binary=True) Word2Vec.load_word2vec_format
4 #model_bin = KeyedVectors.load_word2vec_format(model_file,binary=True)
----> 5 model=gensim.models.Word2Vec.load(model_file)
6 #model=gensim.Word2Vec.load_word2vec_format('model_file',binary=True) word_vectors = KeyedVectors.load(path)
why is it giving

Koteswara-ML · 2019-08-06T10:22:28Z

@kusumlata123 even i am getting that Unpickling Error

bright1993ff66 · 2019-09-16T06:00:31Z

I am also getting the unpickling error...
Any ideas? My code is:

chinese_model = gensim.models.Word2Vec.load(os.path.join(desktop, 'cc.zh.300.bin.gz'))

bright1993ff66 · 2019-09-16T06:15:22Z

I also tried to save the text file and load it via the function provided by the fasttext official site. I first change the file extension from gz to txt and use the following functions:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

However, I got the following errors:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-d67f52bde947> in <module>
----> 1 model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

<ipython-input-3-0f69b5ce62b8> in load_vectors(fname)
      1 def load_vectors(fname):
      2     fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
----> 3     n, d = map(int, fin.readline().split())
      4     data = {}
      5     for line in fin:

ValueError: invalid literal for int() with base 10: '\x08\x08p[\x00\x03cc.zh.300.vec\x00\\ͮfMr7?W3ۀ0|Szдl\x14I\x132'

thejastr · 2020-04-19T13:00:56Z

I tried the above solution but I am getting error as:
UnpicklingError: invalid load key, '\x1f'
My code:
from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

ashutoshsoni891 · 2021-08-04T07:39:15Z

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

trungluu91 · 2021-12-23T05:18:01Z

I came across the same error as well. I changed:
word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)
into
word_vectors = KeyedVectors.load(path)
It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

For Korean language, i got this error:
'AttributeError: Can't get attribute 'Vocab' on <module 'gensim.models.word2vec' from 'C:\Users\ductr\Python\lib\site-packages\gensim\models\word2vec.py'>'
Would you mind letting me know what the error is?

Louislazarus · 2023-08-07T14:41:39Z

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

I get the same error after using:

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
model = Word2Vec.load(model_path)

What am I doing wrong?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte #12

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte #12

liwzhi commented Jan 22, 2018 •

edited

wiwengweng commented Jan 26, 2018

lxw0109 commented Jan 30, 2018

galuhsahid commented Jan 30, 2018

lxw0109 commented Jan 31, 2018

anavaldi commented Mar 9, 2018

Priya22 commented Mar 18, 2018

anavaldi commented Mar 19, 2018 •

edited

hinanmu commented Apr 25, 2018

changhyub commented Apr 26, 2018

gilgtc commented May 22, 2018

caitaozhan commented Jan 17, 2019

kusumlata123 commented Jun 18, 2019

Koteswara-ML commented Aug 6, 2019

bright1993ff66 commented Sep 16, 2019 •

edited

bright1993ff66 commented Sep 16, 2019 •

edited

thejastr commented Apr 19, 2020

ashutoshsoni891 commented Aug 4, 2021

trungluu91 commented Dec 23, 2021

Louislazarus commented Aug 7, 2023

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte #12

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte #12

Comments

liwzhi commented Jan 22, 2018 • edited

wiwengweng commented Jan 26, 2018

lxw0109 commented Jan 30, 2018

galuhsahid commented Jan 30, 2018

lxw0109 commented Jan 31, 2018

anavaldi commented Mar 9, 2018

Priya22 commented Mar 18, 2018

anavaldi commented Mar 19, 2018 • edited

hinanmu commented Apr 25, 2018

changhyub commented Apr 26, 2018

gilgtc commented May 22, 2018

caitaozhan commented Jan 17, 2019

kusumlata123 commented Jun 18, 2019

Koteswara-ML commented Aug 6, 2019

bright1993ff66 commented Sep 16, 2019 • edited

bright1993ff66 commented Sep 16, 2019 • edited

thejastr commented Apr 19, 2020

ashutoshsoni891 commented Aug 4, 2021

trungluu91 commented Dec 23, 2021

Louislazarus commented Aug 7, 2023

liwzhi commented Jan 22, 2018 •

edited

anavaldi commented Mar 19, 2018 •

edited

bright1993ff66 commented Sep 16, 2019 •

edited

bright1993ff66 commented Sep 16, 2019 •

edited