Randomness of MIlvus Collection #32733

Chahnwoo · 2024-04-30T08:14:57Z

Chahnwoo
Apr 30, 2024

I am working with the LangChain integration of Milvus and wanted to ask about how the Milvus HNSW algorithm works. Is there any degree of randomness introduced at any point of graph construction or search?

Here is a general overview of the collection generation process that I am using:

Retrieve a list of texts
Iteratively upload those texts into the MILVUS collection with a pre-defined embedding model

I am not using any randomization in the insert process and have been using the same embedding model and question set for all tests, yet seem to be getting different search results each time I create a new collection. Does anyone know what the problem might be?

Answered by yhmo

May 20, 2024

I mean two different full test processes might get different segment size.
Let's say we want to test 1M entities with different index types.
Process 1: create collection A, insert data batch by batch, 10000 entities for each batch. create index, test search
Process 2: create collection B, insert data batch by batch, 500 entities for each batch. create index, test search

Assume the total data size is 500MB.
The process 1 might get these segments: 130MB + 130MB + 130MB + 110MB.
The process 2 might get these segments: 120MB + 120MB + 120MB + 140MB

Data distribution is different. Search result might be a bit different.

If you use the same collection to test, no additional data inserted. Yes, …

View full answer

yhmo · 2024-04-30T08:35:19Z

yhmo
Apr 30, 2024
Collaborator

Some questions:

How many texts are uploaded to the Milvus collection?
Did you specify the "index_params" when you initialize the Milvus VectorStore?

10 replies

Chahnwoo May 20, 2024
Author

Thank you for the answer! Can I take that to mean that there is no randomness introduced during the search process, the way it is during index building?

yhmo May 20, 2024
Collaborator

Milvus splits data into segments, each segment has an independent index. Segment size could range from 100MB+ to 1GB+. Search engine searches topk results from each segment, and merges N topk to be a final topk result. So, different segment sizes slightly affect the search result. Segment size is affected by insert actions(batch by batch or row by row, insert interval, etc).
Index algorithm also affects search results.

Chahnwoo May 20, 2024
Author

But once the vector database has been fully created (no new inserts or deletions or updates take place), those same indices are used consistently for search, right? Assuming no modifications to a created vector database, there is no randomness introduced to search...?

yhmo May 20, 2024
Collaborator

I mean two different full test processes might get different segment size.
Let's say we want to test 1M entities with different index types.
Process 1: create collection A, insert data batch by batch, 10000 entities for each batch. create index, test search
Process 2: create collection B, insert data batch by batch, 500 entities for each batch. create index, test search

Assume the total data size is 500MB.
The process 1 might get these segments: 130MB + 130MB + 130MB + 110MB.
The process 2 might get these segments: 120MB + 120MB + 120MB + 140MB

Data distribution is different. Search result might be a bit different.

If you use the same collection to test, no additional data inserted. Yes, segment size is not a problem.

Answer selected by Chahnwoo

Chahnwoo May 20, 2024
Author

Thank you so much! Your answers have been so helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomness of MIlvus Collection #32733

{{title}}

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Randomness of MIlvus Collection #32733

Chahnwoo Apr 30, 2024

Replies: 1 comment · 10 replies

yhmo Apr 30, 2024 Collaborator

Chahnwoo May 20, 2024 Author

yhmo May 20, 2024 Collaborator

Chahnwoo May 20, 2024 Author

yhmo May 20, 2024 Collaborator

Chahnwoo May 20, 2024 Author

Chahnwoo
Apr 30, 2024

Replies: 1 comment 10 replies

yhmo
Apr 30, 2024
Collaborator

Chahnwoo May 20, 2024
Author

yhmo May 20, 2024
Collaborator

Chahnwoo May 20, 2024
Author

yhmo May 20, 2024
Collaborator

Chahnwoo May 20, 2024
Author