Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fasttext file format seems wrong #14

Open
adodge opened this issue Mar 1, 2018 · 2 comments
Open

fasttext file format seems wrong #14

adodge opened this issue Mar 1, 2018 · 2 comments

Comments

@adodge
Copy link

adodge commented Mar 1, 2018

Thank you very much for this project. It seems very useful.

I don't seem to be able to use the fasttext files, at least not the Russian or Turkish ones. When attempting to load them with fasttext, I get this error:

$ fasttext print-word-vectors ru.bin
terminate called after throwing an instance of 'std::invalid_argument'
  what():  ru.bin has wrong file format!
Aborted

On closer inspection, the files are missing the fasttext magic number in their header. Fasttext binary files are expected to start with 0x2F4F16BA, and this one doesn't.

Were they created by some other software, or perhaps an older version of fasttext that had a different file format?

Thank you.

@adodge
Copy link
Author

adodge commented Mar 1, 2018

I did a little poking around in the fasttext history, and, yes, they had a different file format a year ago.

  • There's no magic number or version at the top of the file.
  • There's no "pruneidx_size" value in the header for the dictionary object.
  • There's no "quant" boolean before each of the two matrix objects.

This is a script that will convert one of the old fasttext files to something the current version can read:

fasttext_file_update.py.txt

$ echo merhaba | fasttext print-word-vectors tr.bin2
merhaba 0.12206 0.066014 0.093112 -0.043492 0.5207 0.057019 0.20127 0.20933 0.057977 -0.29209 0.087561 0.05825 0.50264 -0.17409 0.19332 -0.08724 0.35125 0.045985 0.21882 0.1872 0.16603 0.21172 0.17046 0.062976 -0.022134 -0.50327 -0.064927 0.1336 0.10681 -0.1902 0.030359 -0.075208 -0.19389 0.40742 0.078176 0.11845 -0.057126 0.52497 0.11417 0.36205 -0.055332 -0.2492 0.46497 0.72146 0.42214 0.082853 0.035755 -0.1644 -0.23566 0.1037 -0.079192 0.15678 -0.14464 -0.023746 0.11418 0.21951 -0.20679 -0.11682 -0.020332 -0.07834 0.27913 -0.59613 -0.15867 0.15623 0.066335 0.078509 -0.0045359 -0.15227 -0.025417 -0.14899 -0.25298 0.2158 -0.26728 0.071114 -0.86768 -0.39044 -0.36575 0.053666 0.38771 0.3328 0.085293 -0.12563 0.13022 -0.21437 0.31115 0.013396 0.02462 -0.25962 -0.51704 -0.55816 0.43276 0.25894 -0.55603 0.3785 -0.13968 0.0031102 0.23232 0.11755 0.17286 -0.14933 0.19528 0.36565 -0.19717 0.066704 -0.20812 -0.32329 -0.09979 -0.34596 0.12763 -0.26259 -0.13747 -0.056275 0.47636 -0.068787 0.05284 -0.16213 -0.57922 -0.15148 0.31464 0.23883 -0.43305 0.21852 -0.082744 0.26875 -0.28505 -0.379 -0.24597 -0.11538 0.22466 -0.17107 0.047522 0.31911 0.15056 0.21347 0.16531 -0.078537 0.14234 0.090975 -0.4294 0.067041 0.085503 0.41908 0.18248 0.18221 0.10699 -0.21135 0.1343 -0.05573 -0.16256 -0.39946 0.086395 -0.030858 -0.66857 0.58846 0.17388 0.56812 -0.088791 -0.024312 -0.054497 -0.075219 -0.0048822 -0.17311 0.070715 0.080788 0.14496 0.45174 0.071725 -0.14704 0.56277 0.058342 0.67329 0.22379 -0.13657 -0.11677 0.31955 0.21028 -0.24803 -0.34743 0.0019436 0.26037 0.49244 0.2648 -0.07083 -0.26863 -0.24654 -0.025958 -0.27783 -0.045067 -0.068344 0.16087 0.11595 -0.044365 0.029121 0.12629 0.28304 0.23161 -0.17879 -0.092399 -0.38922 -0.24235

@yaziciemre
Copy link

somehow it does not work also


Traceback (most recent call last):
  File "fast_convert.py", line 57, in <module>
    m,n = struct.unpack("@qq", M[offset:offset+span])
struct.error: unpack requires a string argument of length 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants