Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Vector NN search count does not respect WHERE clause #4000

Open
2 tasks done
orimay opened this issue May 8, 2024 · 5 comments · May be fixed by #4036
Open
2 tasks done

Bug: Vector NN search count does not respect WHERE clause #4000

orimay opened this issue May 8, 2024 · 5 comments · May be fixed by #4036
Assignees
Labels
feature New feature or request topic:indexing This is related to indexing and full-text search topic:surrealql This is related to the SurrealQL query language

Comments

@orimay
Copy link

orimay commented May 8, 2024

Describe the bug

I want to be able to perform query against vector field and some scalar fields at the same time. Say, I have vector V. I want to find closest 10 points, that have some flag set to true.

It doesn't seem to be possible to utilize vector index for such things. If we query

SELECT
  id, flag, v
FROM
  target
WHERE
  flag = true &&
  v <|10|> $point
;

it will take the closest 10 points and then filter those by flag. This way, I end up with less then 10 points.

Same issue comes into play when I try to paginate the results. If I want to specify START, vector query must take it into account, and I need to query <|50|> with START 40 if we want to take 10 records with the offset of 40.

It is impossible to build composite index on multiple fields if one of the fields is intended to be used for vector search.

Steps to reproduce

DELETE FROM pts;

DEFINE INDEX mt_pt1 ON pts FIELDS point MTREE DIMENSION 1;

INSERT INTO pts [
	{ id: pts:1, point: [ 1f ], flag: true },
	{ id: pts:2, point: [ 2f ], flag: false },
	{ id: pts:3, point: [ 3f ], flag: true },
	{ id: pts:4, point: [ 4f ], flag: false },
	{ id: pts:5, point: [ 5f ], flag: true },
	{ id: pts:6, point: [ 6f ], flag: false },
	{ id: pts:7, point: [ 7f ], flag: true }
];

LET $pt = [4.5f];

SELECT
    id,
    i,
    flag,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts
WHERE
    flag = true &&
    point <|2|> $pt
ORDER BY
    similarity DESC PARALLEL;

SELECT
    id,
    i,
    flag,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts
WHERE
    flag = true &&
    point <|2|> $pt
ORDER BY
    similarity DESC
EXPLAIN
;

outputs

[
	{
		flag: true,
		i: NONE,
		id: pts:5,
		similarity: 1
	}
]

-------- Query 6 (200.001µs) --------

[
	{
		detail: {
			plan: {
				index: 'mt_pt1',
				operator: '<2>',
				value: [
					4.5f
				]
			},
			table: 'pts'
		},
		operation: 'Iterate Index'
	},
	{
		detail: {
			type: 'Store'
		},
		operation: 'Collector'
	}
]

Expected behaviour

-------- Query 5 (299.999µs) --------

[
	{
		flag: true,
		i: NONE,
		id: pts:5,
		similarity: 1
	},
	{
		flag: true,
		i: NONE,
		id: pts:3,
		similarity: 1
	}
]

SurrealDB version

https://surrealist.app/

Contact Details

dmitrii.a.baranov@gmail.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@orimay orimay added bug Something isn't working triage This issue is new labels May 8, 2024
@orimay orimay changed the title Bug: Bug: Vector NN search count does not respect WHERE clause May 11, 2024
@phughk
Copy link
Contributor

phughk commented May 13, 2024

Assigning to @emmanuel-keller who is able to answer this better.

@phughk phughk added topic:indexing This is related to indexing and full-text search and removed triage This issue is new labels May 13, 2024
@emmanuel-keller
Copy link
Contributor

emmanuel-keller commented May 14, 2024

The result you get is consistent with the current query logic, in the sense that each predicate returns a subset of the records, and the result is the intersection of these two subsets:

  • flag = true returns [pts:1, pts:3, pts5, pts7]
  • point <|2|> $pt returns [pts:4, pts5]

The intersection of both returns is [pts:5], which is correct from a technical point of view.

That said, it is true that we don't currently have a way to filter the KNN operation so you can have the N results matching the filter.

As a workaround, you could store two tables. One table would store points where the flag is true, and the other would store the points where the flag is false.

DELETE FROM pts_true;
DELETE FROM pts_false;

DEFINE INDEX mt_pt1 ON pts_true FIELDS point MTREE DIMENSION 1;
DEFINE INDEX mt_pt1 ON pts_false FIELDS point MTREE DIMENSION 1;

INSERT INTO pts_true [
	{ id: pts_true:1, point: [ 1f ], flag: true },
	{ id: pts_true:3, point: [ 3f ], flag: true },
	{ id: pts_true:5, point: [ 5f ], flag: true },
	{ id: pts_true:7, point: [ 7f ], flag: true }
];

INSERT INTO pts_false [
	{ id: pts_false:2, point: [ 2f ], flag: false },
	{ id: pts_false:4, point: [ 4f ], flag: false },
	{ id: pts_false:6, point: [ 6f ], flag: false },
];

LET $pt = [4.5f];

SELECT
    id,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts_true
WHERE
    point <|2|> $pt
ORDER BY
    similarity DESC;

Run it with Surrealist

That would work for this simple example, but I agree that it would not be suitable for something more complex involving more intricate filters.

To meet this requirement we could introduce the following syntax:

SELECT
    id,
    flag,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts
WHERE
    flag = true &&
    point <||> $pt
ORDER BY
    similarity DESC
LIMIT 2

In this case, the KNN operator would not limit the result and would only stop providing results once the limit is reached. That would be compatible with the way SurrealDB executes queries, allowing for any complexity in the filtering as well as pagination.

Would that work for you / Would you be happy with this syntax?

@emmanuel-keller emmanuel-keller added feature New feature or request topic:surrealql This is related to the SurrealQL query language and removed bug Something isn't working labels May 14, 2024
@orimay
Copy link
Author

orimay commented May 14, 2024

Yes, this would be amazing, thank you! And thank you for mentioning it in the stream :)

Will it be the only syntax instead of the current one? Having it work only by LIMIT may simplify it

@emmanuel-keller
Copy link
Contributor

We have started working on the implementation. For this to work efficiently, we need to make sure that the query is primarily ordered by the KNN distance. So, we will add a vector::distance::knn() function that will be mandatory to be placed in the ORDER BY clause. It will also return the computed distance, so the distance does not need to be recomputed.

SELECT
    id,
    flag,
    vector::distance::knn() AS distance
FROM
    pts
WHERE
    flag = true &&
    point <||> $pt
ORDER BY
   vector::distance::knn() DESC
LIMIT 2

I think both syntaxes may coexist.

@orimay
Copy link
Author

orimay commented May 17, 2024

Awesome! Will it be possible to order by distance alias? Will it work for cosine? And will it be possible to build index on multiple fields, e.g. vector + boolean?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request topic:indexing This is related to indexing and full-text search topic:surrealql This is related to the SurrealQL query language
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants