Skip to content

This repository contains source code to binarize any real-value word embeddings into binary vectors.

License

Notifications You must be signed in to change notification settings

tca19/near-lossless-binarization

Repository files navigation

                 Near-lossless Binarization of Word Embeddings
                 =============================================

PREAMBLE

	This work  is  one  of  my  contributions  of  my  PhD  thesis  entitled
	"Improving methods to learn word representations for efficient  semantic
	similarities computations" in which  I  propose  new  methods  to  learn
	better word embeddings. You can find and read my thesis freely available
	at https://github.com/tca19/phd-thesis.

ABOUT

	This repository contains source code to  binarize  any  real-value  word
	embeddings into binary  vectors.   It  also  contains  some  scripts  to
	evaluate the performances of the binary vectors on  semantic  similarity
	tasks  and   top-k   queries.    Related   paper   can   be   found   at
	https://aaai.org/ojs/index.php/AAAI/article/view/4692/4570.

	If you use this repository, please cite:

	@inproceedings{tissier2019near,
	  author    = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury},
	  title     = {Near-Lossless Binarization of Word Embeddings},
	  booktitle = {Proceedings of the Thirty-Third {AAAI} Conference on
	               Artificial Intelligence, Honolulu, Hawaii, USA,
	               January 27 - February 1, 2019.},
	  volume    = {33},
	  pages     = {7104--7111},
	  year      = {2019},
	  url       = {https://aaai.org/ojs/index.php/AAAI/article/view/4692},
	  doi       = {10.1609/aaai.v33i01.33017104}
	}

INSTALLATION

	To compile the source files of this repository, you need to have on your
	system:
	  - OpenBLAS [1]
	  - a C compiler (gcc, clang ...)
	  - make

	Then run the command `make` to build the different  binary  executables.

	[1] https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages

USAGE

	1. Binarize word vectors
	------------------------
	Run the executable `binarize` to transform  real-value  embeddings  into
	binary vectors.  The only mandatory command line argument  is  `-input`,
	the filename containing the real-value vectors.

	./binarize -input vectors.vec

	All  the  other  existing  flags  documentation  can   be   found   with
	`./binarize -h` or `./binarize --help`

	Binary vectors are saved by default into the file  `binary_vectors.vec`.
	The first line of this file indicates the number of binary word  vectors
	and the number of bits in each vector. Each following line are formatted
	like:

	WORD INTEGER_1 INTEGER_2 [...]

	Binary vectors are not saved as strings of zeros (0) and ones (1) but as
	groups of unsigned long integers. Each integer represents 64 bits so for
	a binary vector of 256 bits, there are 4 integers (4 * 64 =  256).   The
	binary  vector  of  a  word  is  the   concatenation   of   the   binary
	representations  of  all  the  integers  on  the  rest  of   its   line.

	2. Evaluate semantic similarity
	-------------------------------
	Run  the  executable  `similarity_binary`  to  evaluate   the   semantic
	similarity  correlation  scores  of   the   produced   binary   vectors.

	./similarity_binary binary_vectors.vec

	This repository includes some semantic similarity datasets:
	  - MEN
	  - Rare Word (RW)
	  - SimVerb 3500 (SimVerb)
	  - SimLex 999 (SimLex)
	  - WordSim 353 (WS353)
	To evaluate on other semantic similarity datasets, simply add them  into
	the datasets/ folder and run again the `./similarity_binary` executable.

	3. Top-K queries
	----------------
	Run the executable `topk_binary` to  compute  the  K  closest  neighbors
	words   and   their   respective   similarity   to   a    QUERY    word.

	./topk_binary binary_vectors.vec K QUERY

	The script will report the closest words and their similarity,  as  well
	as the time needed to compute the K closest neighbors.  You can also run
	multiple top-k queries at the same time, simply replace the  QUERY  word
	with a list of space separated words, like:

	./topk_binary binary_vectors.vec 10 queen automobile man moon computer

AUTHOR

	Written  by  Julien  Tissier  <30314448+tca19@users.noreply.github.com>.

COPYRIGHT

	This software is licensed under the GNU GPLv3 license.  See the  LICENSE
	file for more details.

Releases

No releases published

Packages

No packages published