-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: optimize specific indices #2192
base: main
Are you sure you want to change the base?
Conversation
wjones127
commented
Apr 12, 2024
•
edited
edited
- Allows optimizing a specific set of indices
- Allows creating a delta index for a specific set of fragments
- Eliminates the index metadata caching in Python, since it had no invalidation.
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Will need to rebase on #2132, so waiting on that to merge before this is ready for review. |
feat: implement new parameters refactor: move scalar index optimize into a different file expose options in Pyhton test in python
4ed9fc1
to
55a60bc
Compare
def list_indices(self) -> List[Dict[str, Any]]: | ||
if getattr(self, "_list_indices_res", None) is None: | ||
self._list_indices_res = self._ds.load_indices() | ||
return self._list_indices_res | ||
return self._ds.load_indices() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this cache. There was no invalidation so it is often wrong.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2192 +/- ##
==========================================
- Coverage 81.23% 80.99% -0.24%
==========================================
Files 187 190 +3
Lines 54698 55797 +1099
Branches 54698 55797 +1099
==========================================
+ Hits 44434 45195 +761
- Misses 7771 8081 +310
- Partials 2493 2521 +28
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
self, | ||
merge_indices: bool | int | List[str] = 1, | ||
index_new_data: bool | List[int] = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do self, *, merge_indices, index_new_data
?
merge_indices: bool | int | List[str] | ||
If True, all indices will be merged. If False, no indices will be | ||
merged and instead a new index delta will be created. If an integer, | ||
that number of indices will be merged. If a list of UUID strings, | ||
those specific indices will be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a little confused here because I didn't know if you were talking about "pass in UUIDs if you only want to update some columns (e.g. multiple vector embeddings, each with an index, and only update a few)" or if you were talking about "pass in UUIDs here if you have multiple deltas in the same column to merge together"
@@ -2298,8 +2299,34 @@ def optimize_indices(self, **kwargs): | |||
the new data to existing partitions. This means an update is much quicker | |||
than retraining the entire index but may have less accuracy (especially | |||
if the new data exhibits new patterns, concepts, or trends) | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update some of this wording so that we also explain what "merging" is and why you would want to do it?
options.new_data_handling = NewDataHandling::Fragments(index_new_data_ids); | ||
} else { | ||
return Err(PyValueError::new_err( | ||
"index_new_data must be a boolean value.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a list of fragment ids?
// Note: since we put other delta indexes into data stream, this | ||
// is also the case where we merge indices. | ||
(Some(existing_index), Some(new_data_stream)) => { | ||
// TODO: how can I downcast `existing_index` and use that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I ever did find a good way to solve this problem.