Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use random projection trees #178

Open
Petkomat opened this issue Apr 21, 2022 · 8 comments
Open

Cannot use random projection trees #178

Petkomat opened this issue Apr 21, 2022 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@Petkomat
Copy link
Collaborator

Petkomat commented Apr 21, 2022

Problem description:

My dataset cannot be represented as a N_INSTANCES x N_DIMENSIONS table of floats: the instances are rows from the main table in a relational database.

Thus, my custom distance measure takes the ids of two instances (i1 and i2 in the code below) as its arguments, and computes the distance between the corresponding instances by accessing data from many tables (imagine computing the distance between two movies, given the actors that appeared in the movies, the genres of the movies, etc.).

This is why random projection trees cannot help: randomly dividing instances according to their ids does not make sense.

This is why I use pynndescent.NNDescent(..., tree_init=False). However, when index.prepare() is called, pynndescent again tries to grow some trees and the minimal working example (below) raises an error (no hyper planes found).

An ugly solution:

It turns out that providing examples as [id, id, ..., id] instead of [id] resolves the issue (the more examples we have, the more copies of the id are necessary), but I still think that the behaviour of the prepare() should be considered a bug.

A possible solution [needs to be implemented]:

Maybe we could start the search in a random node, or the closest among random k nodes, or k random nodes.

Traceback:

Traceback (most recent call last):
  File "C:/Users/.../fast_nn/mwe.py", line 31, in <module>
    index.prepare()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1555, in prepare
    self._init_search_graph()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 963, in _init_search_graph
    for tree in rp_forest
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 963, in <listcomp>
    for tree in rp_forest
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\rp_trees.py", line 1158, in convert_tree_format
    hyperplane_dim = dense_hyperplane_dim(tree.hyperplanes)
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\rp_trees.py", line 1140, in dense_hyperplane_dim
    raise ValueError("No hyperplanes of adequate size were found!")
ValueError: No hyperplanes of adequate size were found!

Minimal (not)working example:

import pynndescent
import numpy as np
import numba

np.random.seed(1234)


N_INSTANCES = 100
DATA = np.random.random(N_INSTANCES).astype(np.float32)


@numba.njit(
    numba.types.float32(
        numba.types.Array(numba.types.float32, 1, "C", readonly=True),
        numba.types.Array(numba.types.float32, 1, "C", readonly=True),
    ),
    fastmath=True,
    locals={
        "i1": numba.types.int64,
        "i2": numba.types.int64,
    }
)
def my_metric(i1_array, i2_array):
    i1 = numba.int64(i1_array[0])
    i2 = numba.int64(i2_array[0])
    return np.abs(DATA[i1] - DATA[i2])


xs = np.array([[i for _ in range(1)] for i in range(N_INSTANCES)]).astype(np.float32)
index = pynndescent.NNDescent(xs, n_jobs=1, metric=my_metric, n_neighbors=1, tree_init=False)
index.prepare()
neighbors, distances = index.query(xs)

@lmcinnes
Copy link
Owner

I agree, that's a bug. I think there are some somewhat sensible ways I can fix that, but I'm not sure when I can get to it.

@lmcinnes lmcinnes self-assigned this Apr 21, 2022
@lmcinnes lmcinnes added the bug Something isn't working label Apr 21, 2022
@lmcinnes
Copy link
Owner

I just pushed some fixes for this to the master branch. If you have the time to check it out and see if it resolves your problem I would greatly appreciate it.

@Petkomat
Copy link
Collaborator Author

Petkomat commented May 5, 2022

I pip-uninstalled the version that I got via pip and installed the version from this repository. Now, prepare works, however, the code crashes in the line neighbors, distances = index.query(xs). Here is the full report (together with an irrelevant warning):

C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)
Traceback (most recent call last):
  File "C:/Users/.../fast_nn/mwe.py", line 32, in <module>
    neighbors, distances = index.query(xs)
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent-0.5.5-py3.7.egg\pynndescent\pynndescent_.py", line 1659, in query
    indices = self._vertex_order[indices]
AttributeError: 'NNDescent' object has no attribute '_vertex_order'

@lmcinnes
Copy link
Owner

lmcinnes commented May 6, 2022

Ah yes, I think I see what the problem there would. I think I can fix that; hopefully I'll get to it in the next few days.

@Petkomat
Copy link
Collaborator Author

Petkomat commented May 20, 2022

Now, I looked into the code. If I understand correctly, the exception is raised because the code finds the hyperplanes, but their normal vectors are of dimension 1 (which is the dimension of dummy normal vectors [-1.0]).

Since the dimension of the normal vectors equals the dimension of the data, it seems that changing the signature of rp_trees.convert_tree_format(tree, data_size) to rp_trees.convert_tree_format(tree, data_size, data_dim) and setting hyperplane_dim to data_dim resolves the issue.

Of course, by doing so, we skip the "degenerate-tree" check in rp_tree.dense_hyperplane_dim(tree.hyperplanes) but it seems that this check should be equivalent to the condition number of examples in data > leaf_size (present both in make_angular_tree and make_euclidean_tree), which can be easily tested if wanted.

However, (why) do we even care whether trees contain more than one node, i.e., at least one hyperplane was found?

I played around a bit (using hyperplane_dim = data_dim and leaf_size = 200 while N_INTANCES = 100) and it seems that everything works normally (1D data and/ or degenerate single-node trees).

EDIT: Also all the existing pytests pass (but there is no test yet that covers 1D data).

@lmcinnes
Copy link
Owner

lmcinnes commented May 20, 2022

A yes, I see. The 1D data is going to potentially be an issue. I'll add a test for 1D data and see if I can wring out the last issues.

lmcinnes added a commit that referenced this issue May 20, 2022
@lmcinnes
Copy link
Owner

With regard to the changes you are proposing here. They seem to make sense, but I admit (without line references) I don't entirely follow exactly what you are proposing. I would certainly welcome a PR with these changes if you can manage it.

@Petkomat
Copy link
Collaborator Author

Petkomat commented May 23, 2022

By mistake, I have also included my implementation of NNDescent.update_with_changed_data, which gives an option for efficient updates of data that changes with time (as is the case in my usecase). In that method, the updated rows (in raw data) are updated (instead of appended to the existing data). Distances that did not change are kept as they are (this is the role of init_dist argument in the constructor). If any new data come, "the official" update is called.

(tested in test_update_with_changed_data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants