Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nested loop sequential or parallelization runtime #188

Open
lyelibi opened this issue Jun 3, 2022 · 1 comment
Open

nested loop sequential or parallelization runtime #188

lyelibi opened this issue Jun 3, 2022 · 1 comment

Comments

@lyelibi
Copy link

lyelibi commented Jun 3, 2022

I have recently started using Pynndescent to create nearest neighbor graphs. It performs much faster and is more stable than sklearn nearest neighbor graph alternative.
The problem I am trying to solve is about scale: I generate a data-set whose columns are used to compute nearest neighbor graphs which means that for a 500 x 50 matrix (the example below) I have to compute 50*(50-1)/2 nearest neighbor graphs which is 1225 calls to pynndescent. This takes 9 seconds on my machine which is very limiting because I would like to do this for significantly larger data-sets (i.e. N = 1000 as opposed to 50 ).

Are there ways to optimize and do something else than a for loop? I serialize instead of the naive nested loop but python threading and multiprocessing haven't produced better result than the solution below.

data = np.random.normal(0,1, (500,50) )

  idx = list(itertools.combinations( range(data.shape[0]),2))

  count = len(idx)

  s = np.empty( N , dtype=np.float16 )


    for i in range(count):
        s[i] = NNDescent(data[:,idx[i]], metric='euclidean', n_neighbors = 5).neighbor_graph[1].sum()

@lyelibi lyelibi changed the title nested loop sequential or parrelization runtime nested loop sequential or parallelization runtime Jun 3, 2022
@lmcinnes
Copy link
Owner

lmcinnes commented Jun 4, 2022

Off the top of my head I don't see any obvious ways to improve this significantly. There is just a lot of computation work to be done, and I'm not sure there are ways around that. You would have to look into deeper algorithmic approaches to cut down on work I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants