LSHVec: A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

July 2021: checkout lshvec-upcxx which is a pure c++ implementation.

Summary

LSHVec is a k-mer/sequence embedding/classfication software which extends FastText . It applies LSH (Locality Sensitive Hashing) to reduce the size of k-mer vocabulary and improve the performance of embedding.

Besides building from source code, LSHVec can run using docker or singularity.

Please refer to A Vector Representation of DNA Sequences Using Locality Sensitive Hashing for the idea and experiments.

There are also some pretained models that can be used, please see PyLSHvec for details.

Requirements

Here is the environment I worked on. Other versions may also work. Python 3 should work, but I don't use it a lot.

Linux, gcc with C++11
Python 2.7 or Python 3.6 or 3.7
- joblib 0.12.4
- tqdm 4.28.1
- numpy 1.15.0
- pandas 0.23.4
- sklearn 0.19.1 (only for evaluation)
- MulticoreTSNE (only for visualization)
- cython 0.28.5
- csparc (included)

Build from Source

clone from git

git clone https://LizhenShi@bitbucket.org/LizhenShi/lshvec.git

cd lshvec
install csparc which wraps a c version of k-mer generator I used in another project

for python 2.7

pip install pysparc-0.1-cp27-cp27mu-linux_x86_64.whl

or for python 3.6

pip install pysparc-0.1-cp36-cp36m-linux_x86_64.whl

or for python 3.7

pip install pysparc-0.1-cp37-cp37m-linux_x86_64.whl
make

make

Jupyter Notebook Examples

A toy example, which is laptop friendly and should finish in 10 minutes, can be found in Tutorial_Toy_Example.ipynb. Because of randomness the result may be different.

A practical example which uses ActinoMock Nanopore data can be found at Tutorial_ActinoMock_Nanopore.ipynb. The notebook ran on a 16-core 64G-mem node and took a few hours (I think 32G mem should work too).

Command line options

fastqToSeq.py

convert a fastq file to a seq file

python fastqToSeq.py -i <fastq_file> -o <out seq file> -s <1 to shuffle, 0 otherwise>

hashSeq.py

Encode reads in a seq file use an encoding method.

python hashSeq.py -i <seq_file> --hash <fnv or lsh> -o <outfile> [-k <kmer_size>] [--n_thread <n>] [--hash_size <m>] [--batch_size <n>] [--bucket <n>] [--lsh_file <file>] [--create_lsh_only]

  --hash_size <m>:        only used by lsh which defines 2^m bucket.
  --bucket <n>:           number of bucket for hash trick, useless for onehot.
   				          For fnv and lsh it limits the max number of words.
   				          For lsh the max number of words is min(2^m, n).
  --batch_size <b>:       how many reads are processed at a time. A small value uses less memory.

lshvec

Please refer to fasttext options. However note that options of wordNgrams, minn,maxn does not work with lshvec.

Example of Docker Run

Pull from docker hub:

docker pull lizhen0909/lshvec:latest

Assume data.fastq file is in folder /path/in/host.

convert fastq to a seq file:

docker run -v /path/in/host:/host lshvec:latest bash -c "cd /host && fastqToSeq.py  -i data.fastq -o data.seq"

create LSH:

docker run -v /path/in/host:/host lshvec:latest bash -c "cd /host && hashSeq.py -i data.seq --hash lsh -o data.hash -k 15"

run lshvec:

docker run -v /path/in/host:/host lshvec:latest bash -c "cd /host && lshvec skipgram -input data.hash -output model"

Example of Singularity Run

When running using Singularity, it is probably in an HPC environment. The running is similar to docker. However depending on the version of singularity, commands and paths might be different, especially from 2.x to 3.x. Here is an example for version 2.5.0.

Also it is better to specify number of threads, otherwise max number of cores will be used which is not desired in HPC environment.

Pull from docker hub:

singularity pull --name lshvec.sif shub://Lizhen0909/LSHVec

Put data.fastq file is in host /tmp, since Singularity automatically mount /tmp folder.

convert fastq to a seq file:

singularity run /path/to/lshvec.sif bash -c "cd /tmp && fastqToSeq.py  -i data.fastq -o data.seq"

create LSH:

singularity run /path/to/lshvec.sif bash -c "cd /tmp && hashSeq.py -i data.seq --hash lsh -o data.hash -k 15 --n_thread 12"

run lshvec:

singularity run /path/to/lshvec.sif bash -c "cd /tmp && lshvec skipgram -input data.hash -output model -thread 12"

Questions

lshvec gets stuck at Read xxxM words

Search MAX_VOCAB_SIZE in the source code and change it to a bigger one. When a word's index is bigger than that number, a loop is carried to query it, which is costly. The number is 30M in FastText which is good for languages. But it is too small for k-mers. The number has been already increased to 300M in FastSeq. But for large and/or high-error-rate data, it may be still not enough.

I have big data

hashSeq reads all data into memory to sample k-mers for hyperplanes. If data is too big it may not fit into memory. One can
1. Try sampling. DNA reads generally have high coverage. Such high coverage may not be necessary.
2. Or use create_hash_only to create lsh on a small (sampled) data; then split your data into multiple files and run hashSeq with lsh_file option on many nodes.
core dumped when hashing

Error like
```
terminate called after throwing an instance of 'std::out_of_range'
what(): map::at
Aborted (core dumped)
```
mostly because a sequence contains characters other than ACGTN. So please convert non-ACGT characters to N's.

License

Inherit license from FastText which is BSD License

Name		Name	Last commit message	Last commit date
Latest commit History 349 Commits
.circleci		.circleci
alignment		alignment
docker		docker
docs		docs
notebook		notebook
python		python
scripts		scripts
singularity		singularity
src		src
tests		tests
website		website
.cproject		.cproject
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
PATENTS		PATENTS
README.md		README.md
eval.py		eval.py
pretrained-vectors.md		pretrained-vectors.md
pysparc-0.1-cp27-cp27mu-linux_x86_64.whl		pysparc-0.1-cp27-cp27mu-linux_x86_64.whl
pysparc-0.1-cp36-cp36m-linux_x86_64.whl		pysparc-0.1-cp36-cp36m-linux_x86_64.whl
pysparc-0.1-cp37-cp37m-linux_x86_64.whl		pysparc-0.1-cp37-cp37m-linux_x86_64.whl
runtests.py		runtests.py
setup.cfg		setup.cfg
setup.py		setup.py

License

Lizhen0909/LSHVec

Folders and files

Latest commit

History

Repository files navigation

LSHVec: A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

July 2021: checkout lshvec-upcxx which is a pure c++ implementation.

Summary

Requirements

Build from Source

Jupyter Notebook Examples

Command line options

fastqToSeq.py

hashSeq.py

lshvec

Example of Docker Run

Example of Singularity Run

Questions

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages