Cannot use random projection trees #178

Petkomat · 2022-04-21T13:13:03Z

Problem description:

My dataset cannot be represented as a N_INSTANCES x N_DIMENSIONS table of floats: the instances are rows from the main table in a relational database.

Thus, my custom distance measure takes the ids of two instances (i1 and i2 in the code below) as its arguments, and computes the distance between the corresponding instances by accessing data from many tables (imagine computing the distance between two movies, given the actors that appeared in the movies, the genres of the movies, etc.).

This is why random projection trees cannot help: randomly dividing instances according to their ids does not make sense.

This is why I use pynndescent.NNDescent(..., tree_init=False). However, when index.prepare() is called, pynndescent again tries to grow some trees and the minimal working example (below) raises an error (no hyper planes found).

An ugly solution:

It turns out that providing examples as [id, id, ..., id] instead of [id] resolves the issue (the more examples we have, the more copies of the id are necessary), but I still think that the behaviour of the prepare() should be considered a bug.

A possible solution [needs to be implemented]:

Maybe we could start the search in a random node, or the closest among random k nodes, or k random nodes.

Traceback:

Traceback (most recent call last):
  File "C:/Users/.../fast_nn/mwe.py", line 31, in <module>
    index.prepare()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1555, in prepare
    self._init_search_graph()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 963, in _init_search_graph
    for tree in rp_forest
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 963, in <listcomp>
    for tree in rp_forest
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\rp_trees.py", line 1158, in convert_tree_format
    hyperplane_dim = dense_hyperplane_dim(tree.hyperplanes)
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\rp_trees.py", line 1140, in dense_hyperplane_dim
    raise ValueError("No hyperplanes of adequate size were found!")
ValueError: No hyperplanes of adequate size were found!

Minimal (not)working example:

import pynndescent
import numpy as np
import numba

np.random.seed(1234)


N_INSTANCES = 100
DATA = np.random.random(N_INSTANCES).astype(np.float32)


@numba.njit(
    numba.types.float32(
        numba.types.Array(numba.types.float32, 1, "C", readonly=True),
        numba.types.Array(numba.types.float32, 1, "C", readonly=True),
    ),
    fastmath=True,
    locals={
        "i1": numba.types.int64,
        "i2": numba.types.int64,
    }
)
def my_metric(i1_array, i2_array):
    i1 = numba.int64(i1_array[0])
    i2 = numba.int64(i2_array[0])
    return np.abs(DATA[i1] - DATA[i2])


xs = np.array([[i for _ in range(1)] for i in range(N_INSTANCES)]).astype(np.float32)
index = pynndescent.NNDescent(xs, n_jobs=1, metric=my_metric, n_neighbors=1, tree_init=False)
index.prepare()
neighbors, distances = index.query(xs)

The text was updated successfully, but these errors were encountered:

lmcinnes · 2022-04-21T16:25:03Z

I agree, that's a bug. I think there are some somewhat sensible ways I can fix that, but I'm not sure when I can get to it.

lmcinnes · 2022-04-25T20:02:04Z

I just pushed some fixes for this to the master branch. If you have the time to check it out and see if it resolves your problem I would greatly appreciate it.

Petkomat · 2022-05-05T07:20:39Z

I pip-uninstalled the version that I got via pip and installed the version from this repository. Now, prepare works, however, the code crashes in the line neighbors, distances = index.query(xs). Here is the full report (together with an irrelevant warning):

C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)
Traceback (most recent call last):
  File "C:/Users/.../fast_nn/mwe.py", line 32, in <module>
    neighbors, distances = index.query(xs)
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent-0.5.5-py3.7.egg\pynndescent\pynndescent_.py", line 1659, in query
    indices = self._vertex_order[indices]
AttributeError: 'NNDescent' object has no attribute '_vertex_order'

lmcinnes · 2022-05-06T03:16:08Z

Ah yes, I think I see what the problem there would. I think I can fix that; hopefully I'll get to it in the next few days.

Petkomat · 2022-05-20T09:22:44Z

Now, I looked into the code. If I understand correctly, the exception is raised because the code finds the hyperplanes, but their normal vectors are of dimension 1 (which is the dimension of dummy normal vectors [-1.0]).

Since the dimension of the normal vectors equals the dimension of the data, it seems that changing the signature of rp_trees.convert_tree_format(tree, data_size) to rp_trees.convert_tree_format(tree, data_size, data_dim) and setting hyperplane_dim to data_dim resolves the issue.

Of course, by doing so, we skip the "degenerate-tree" check in rp_tree.dense_hyperplane_dim(tree.hyperplanes) but it seems that this check should be equivalent to the condition number of examples in data > leaf_size (present both in make_angular_tree and make_euclidean_tree), which can be easily tested if wanted.

However, (why) do we even care whether trees contain more than one node, i.e., at least one hyperplane was found?

I played around a bit (using hyperplane_dim = data_dim and leaf_size = 200 while N_INTANCES = 100) and it seems that everything works normally (1D data and/ or degenerate single-node trees).

EDIT: Also all the existing pytests pass (but there is no test yet that covers 1D data).

lmcinnes · 2022-05-20T16:03:47Z

A yes, I see. The 1D data is going to potentially be an issue. I'll add a test for 1D data and see if I can wring out the last issues.

lmcinnes · 2022-05-20T16:43:10Z

With regard to the changes you are proposing here. They seem to make sense, but I admit (without line references) I don't entirely follow exactly what you are proposing. I would certainly welcome a PR with these changes if you can manage it.

Petkomat · 2022-05-23T14:30:51Z

By mistake, I have also included my implementation of NNDescent.update_with_changed_data, which gives an option for efficient updates of data that changes with time (as is the case in my usecase). In that method, the updated rows (in raw data) are updated (instead of appended to the existing data). Distances that did not change are kept as they are (this is the role of init_dist argument in the constructor). If any new data come, "the official" update is called.

(tested in test_update_with_changed_data)

lmcinnes self-assigned this Apr 21, 2022

lmcinnes added the bug Something isn't working label Apr 21, 2022

lmcinnes added a commit that referenced this issue Apr 25, 2022

Proposed fix for #178 allowing querying without trees

cc6ed32

lmcinnes added a commit that referenced this issue May 13, 2022

Finish fixes for #178; Ensure we provide at least some vertex order

6b3158c

lmcinnes added a commit that referenced this issue May 14, 2022

Add a test to verify #178 is fixed

58a5f77

lmcinnes added a commit that referenced this issue May 20, 2022

Add 1D test per #178

28277af

This was referenced May 23, 2022

Accepting rp trees with no splits #184

Closed

Accepting rp trees with no splits #185

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot use random projection trees #178

Cannot use random projection trees #178

Petkomat commented Apr 21, 2022 •

edited

lmcinnes commented Apr 21, 2022

lmcinnes commented Apr 25, 2022

Petkomat commented May 5, 2022

lmcinnes commented May 6, 2022

Petkomat commented May 20, 2022 •

edited

lmcinnes commented May 20, 2022 •

edited

lmcinnes commented May 20, 2022

Petkomat commented May 23, 2022 •

edited

Cannot use random projection trees #178

Cannot use random projection trees #178

Comments

Petkomat commented Apr 21, 2022 • edited

lmcinnes commented Apr 21, 2022

lmcinnes commented Apr 25, 2022

Petkomat commented May 5, 2022

lmcinnes commented May 6, 2022

Petkomat commented May 20, 2022 • edited

lmcinnes commented May 20, 2022 • edited

lmcinnes commented May 20, 2022

Petkomat commented May 23, 2022 • edited

Petkomat commented Apr 21, 2022 •

edited

Petkomat commented May 20, 2022 •

edited

lmcinnes commented May 20, 2022 •

edited

Petkomat commented May 23, 2022 •

edited