Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte #12

Open
liwzhi opened this issue Jan 22, 2018 · 19 comments

Comments

@liwzhi
Copy link

liwzhi commented Jan 22, 2018

Hi,

I am trying to load Chinese pretrained word2vec,
word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) # C binary format

it throws this error.

@wiwengweng
Copy link

of cause the vector should be trained using the proper codec, it seems the model is trained in other coding environment. Can you check that.

@lxw0109
Copy link

lxw0109 commented Jan 30, 2018

I have come across the same error, anybody help? Thank you ~

@galuhsahid
Copy link

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

@lxw0109
Copy link

lxw0109 commented Jan 31, 2018

@galuhsahid Thank you so much, it works now. : )

@anavaldi
Copy link

anavaldi commented Mar 9, 2018

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

@Priya22
Copy link

Priya22 commented Mar 18, 2018

Same error as @anavaldi . Any solution?

@anavaldi
Copy link

anavaldi commented Mar 19, 2018

I solve this error by executing on my own word embeddings with the .sh file.

@hinanmu
Copy link

hinanmu commented Apr 25, 2018

I have come across the same error. I changed gensim.models.KeyedVectors.load_word2vec_format()
into gensim.models.Word2Vec.load() .Then it works

@changhyub
Copy link

@hinamu it works, Thanks

@gilgtc
Copy link

gilgtc commented May 22, 2018

@anavaldi

I solve this error by executing on my own word embeddings with the .sh file.

What do you mean?

@caitaozhan
Copy link

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

I solved this issue by degrading my gensim version from 3.6 to 3.0

@kusumlata123
Copy link

UnpicklingError Traceback (most recent call last)
in ()
3 #model=gensim.models.Word2Vec.load_word2vec_format('model_file', binary=True) Word2Vec.load_word2vec_format
4 #model_bin = KeyedVectors.load_word2vec_format(model_file,binary=True)
----> 5 model=gensim.models.Word2Vec.load(model_file)
6 #model=gensim.Word2Vec.load_word2vec_format('model_file',binary=True) word_vectors = KeyedVectors.load(path)
why is it giving

@Koteswara-ML
Copy link

@kusumlata123 even i am getting that Unpickling Error

@bright1993ff66
Copy link

bright1993ff66 commented Sep 16, 2019

I am also getting the unpickling error...
Any ideas? My code is:

chinese_model = gensim.models.Word2Vec.load(os.path.join(desktop, 'cc.zh.300.bin.gz')) 

@bright1993ff66
Copy link

bright1993ff66 commented Sep 16, 2019

I also tried to save the text file and load it via the function provided by the fasttext official site. I first change the file extension from gz to txt and use the following functions:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

However, I got the following errors:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-d67f52bde947> in <module>
----> 1 model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

<ipython-input-3-0f69b5ce62b8> in load_vectors(fname)
      1 def load_vectors(fname):
      2     fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
----> 3     n, d = map(int, fin.readline().split())
      4     data = {}
      5     for line in fin:

ValueError: invalid literal for int() with base 10: '\x08\x08p[\x00\x03cc.zh.300.vec\x00\\ͮfMr7?W3ۀ0|Szдl\x14I\x132'

@thejastr
Copy link

I tried the above solution but I am getting error as:
UnpicklingError: invalid load key, '\x1f'
My code:
from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

@ashutoshsoni891
Copy link

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

@trungluu91
Copy link

I came across the same error as well. I changed:
word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)
into
word_vectors = KeyedVectors.load(path)
It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

For Korean language, i got this error:
'AttributeError: Can't get attribute 'Vocab' on <module 'gensim.models.word2vec' from 'C:\Users\ductr\Python\lib\site-packages\gensim\models\word2vec.py'>'
Would you mind letting me know what the error is?

@Louislazarus
Copy link

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

I get the same error after using:

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
model = Word2Vec.load(model_path)

What am I doing wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests