Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When distance computation is expensive how to gradually build graph #194

Open
jianshu93 opened this issue Jul 16, 2022 · 3 comments
Open

Comments

@jianshu93
Copy link

Hello pynndescent team,

I am in a situation where, distance computation is very expensive (genomic sequence), all versus all distance as input is not possible (it takes forever for all versus all distance computation). I am wondering how to compute distance only when necessary in NNDescent() building, that is k graph is build gradually without the need to compute all versus all in the dataset but only those are needed, e.g., 2 datapoint that are far away from each other in k graph, we do not need their distance. Any idea how, original data as input and we have our own distance function.

Thanks,

Jianshu

@jlmelville
Copy link
Collaborator

Others may have better advice but here is my perspective (FWIW I have spent a reasonable chunk of my spare time looking at how approximate nearest neighbors and nearest neighbor descent specifically can give sub-optimal results for high-dimensional datasets).

The problem I have seen with high dimensional datasets is the existence of hubs, which in turn implies the existence of anti-hubs, objects which have the following unfortunate properties:

  • there are not in the k-nearest neighbors of other objects
  • the heuristic that the neighbor-of-neighbors are good candidates for trying tends to be less true

this is bad news for the efficiency of nearest neighbor descent, which already tends to repeat distance calculations a lot. This is a scenario where setting low_memory=False might help if you have enough RAM because the cost of storing previously calculated pairs and the repeated application of the hash function is still going to save time over repeating an expensive distance calculation.

Apart from that, tweaking the nearest neighbor descent parameters to be more thorough doesn't seem to be very cost effective compared to: using default build parameters, creating the search graph and then querying the dataset (with some amount of backtracking via the epsilon parameter). I think the contents of https://github.com/lmcinnes/pynndescent/blob/master/doc/how_to_use_pynndescent.ipynb is still a good guide for that.

Although this also depends on your distance function, you could also try reducing the dimensionality of the data with something cheap like a truncated SVD, do nearest neighbor descent on that, and then run nearest neighbor descent on the high dimensional data using the knn graph from the previous step as initialization via init_graph. Obviously this requires those initial steps to be pretty fast as well as being a better guess for the real results than RP tree in the high dimensional space gives you. Unfortunately I haven't been impressed with the results I've got from this, but you never know how it might work for you.

@jianshu93
Copy link
Author

Hello James,

Thanks for the response and detailed explanation. In my case, dimensions is not a problem. The reason that distance computation is expensive is not because it has a lot of dimension but the genomic sequence manipulation (some sequence alignment related steps). In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used. I am comparing it to HNSW because only a very small fraction of pairwise distances was used to build the graph (I noticed that in hnsw computation of distance happened only when a pairwise distance was needed), which makes it very efficient when distance computation is very expensive. However, it works only for metric distance like hamming or Euclidean distance. The distance I am using this time for k graph is not a metric so HNSW does not work.

Thanks,

Jianshu

@jlmelville
Copy link
Collaborator

In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used.

I doubt there is anyway to know because it will be highly dataset dependent. The more hubness the true nearest neighbor graph displays, the more repeated calculations there will be: by definition hubs will appear in the neighbor list of other objects at a much higher frequency than other observations. That means they will get involved in a large number of distance calculations many of which are redundant. If hubness is an issue for your dataset then it can be detected even when the nearest neighbor graph is quite rough, e.g. it will be apparent after one iteration of NND. Unfortunately, at this point you have still carried out a large number of distance calculations and there isn't a good way to fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants