Skip to content
/ hftrie Public

index binary vectors for efficient nearest neighbor search

License

Notifications You must be signed in to change notification settings

starkdg/hftrie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HFTrie

A dynamic data structure for indexing 64-bit binary codes for efficient nearest neighbor search. The index offers the capability to add, delete and range search for all points within a given radius of a target point.

Description

The HFTrie is a data structure where the indexed elements - or data points - are treated as a sequence of bits. Each layer of the trie sorts the data points according to a prefix portion of bits. For example, layer one sorts the data point into a subnode corresponding to the value of its first bit. The second layer sorts the datapoints into the subnode for the value of its first two bits. And so on.

flowchart TD
		  root[R] --> L[0xxx]
		  root[R] --> R[1xxx]
		  L --> LL[00xx]
		  L --- RR[01xx]
		  R --> LL_[10xx]
		  R --> RR_[11xx]

That's the basic idea, but for efficiency, the bits are chunked together. So, instead of considering only one bit per level, multiple bits (e.g. 4) are used from the bit sequence at each level to sort into child nodes. This naturally increases the fanout of each node. For two bits per layer, each node will sort into four possible child nodes. For 4 bits per layer, there are 16 possible child nodes. And so on.

Performance

Here are the performance metrics for the HFTrie for various index sizes. The insert and query times are the average times it takes to insert or query the index. Query Operations is for the average number of distance calculations performed for a range search, normalized to the size of the index and expressed as a percentage. The last column is for the recall percentage, or the average proportion of the cluster that should be returned with each query. The fast range search sacrfices a bit in accuracy for a significant boost in speed efficiency.

Fast Range Search

N MEM Insert Time Query Opers. Query Time Recall
100K 4.0MB 135 ns 0.97% 37.1 μs 87.2%
200K 6.5MB 146 ns 0.96% 72.9 μs 89.4%
400K 12MB 172 ns 0.89% 124 μs 88.2%
800K 35MB 241 ns 0.43% 206 μs 86.2%
1M 45MB 263 ns 0.34% 250 μs 86.0%
2M 75MB 297 ns 0.30% 387 μs 86.2%
4M 127MB 330 ns 0.30% 618 μs 85.6%
8M 268MB 373 ns 0.25% 953 μs 85.0%
16M 727MB 461 ns 0.11% 1.75 ms 85.2%
32M 1.2GB 538 ns 0.09% 3.27 ms 86.4%
64M 1.9GB 613 ns 0.09% 5.21 ms 86.4%
128M 4.3GB 674 ns 0.08% 7.39 ms 92.4%

For a constant index size, N = 4M, the query performance varies for different radius values. Better accuracy is guaranteed for lower radius searches with modest reductions in accuracy for larger range searches.

Radius Query Opers. Query Time Recall
0 0.0003% 1.46 μs 100%
2 0.02% 49.0 μs 99.8%
4 0.02% 434 μs 96.6%
6 0.30% 613 μs 93.4%
8 0.30% 617 μs 88.6%
10 0.30% 616 μs 83.2%
12 0.30% 619 μs 88.8%

Standard Range Search

A standard range search guarantees query accuracy with significant added costs to speed.

N MEM Insert time Query Opers. Query Time Recall
100K 4.0MB 134 ns 89.5% 3.40 ms 100%
200K 6.5MB 142 ns 89.5% 6.54 ms 100%
400K 12MB 169 ns 86.7% 11.7 ms 100%
800K 35MB 238 ns 65.7% 30.4 ms 100%
1M 45MB 257 ns 60.7% 39.7 ms 100%
2M 75MB 295 ns 58.8% 68.5 ms 100%
4M 127MB 327 ns 58.6% 111 ms 100%
8M 268MB 363 ns 51.2% 191 ms 100%
16M 727MB 457 ns 29.0% 454 ms 100%
32M 1.2GB 533 ns 27.1% 783 ms 100%
64M 1.9GB 604 ns 27.0% 1.29 s 100%
128M 4.3GB 665 ns 22.8% 1.95 s 100%

Comparison

Here is a comparison of different methods of indexing a uniform distributed data set of 4 million. Query operations notes the number of distance computations performed per query, normalized to the size of the dataset. All search queries are peformed for a radius = 10. Note that all methods underperform a simple sequential search at such a large radius. Only the HFTrie's fast range search outperforms the sequential search with the drawback that it is only able to achieve 85% recall of all cluster objects for an extended radius. It is, however, guaranteed to return all exact matches for smaller range searches.

For data set size, N = 4M and query radius = 10:

Method MEM Insert Time Query Opers. Query Time Recall
SequentialSearch 64MB n/a 100% 9.08 ms 100%
MVPTree 2.67GB 43.9μs 25.0% 276 ms 100%
MTree 113MB 631ns 69.5% 68.2 ms 100%
HWTree 312MB 1.69 μs 17.6% 222 ms 100%
HFTrie-std 124MB 327 ns 58.6% 111 ms 100%
HFTrie-fast 124MB 330 ns 0.30% 618 μs 85.6%

You can run these tests with the compiled program runhftrie. View the source in tests/run_hftrie.cpp.

Install

cmake .
make
make test
make install

Simple API

HFTrie trie;

long long id;
uint64_t code;

// insert an element
trie.Insert({ id, code });

// delete an element
trie.Delete({ id, code });


uint64_t target;
int radius = 10;
vector<hf_t> results trie.RangeSearch(target, radius);

size_t sz = trie.Size();

size_t nbytes = trie.MemoryUsage();

// for debugging
trie.Print(cout);