New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support Indexing options for Astra DB columns #2919
base: main
Are you sure you want to change the base?
feat: Support Indexing options for Astra DB columns #2919
Conversation
@potter-potter Keeping this as a draft for now as its a fairly decent restructuring of the initialization process, but we've had users that have had issues with the integration because they have some columns that are very long text columns which by default get indexed. The goal of this PR is to allow the users to specify at creation time which columns are not to be indexed, because Astra has a limit internally. Would love to run the lint and other checks on this if possible! I tried to run as much as possible locally for now. |
Marked it as ready for review now after some testing internally with the team. The primary change here is we give flexibility in which Astra DB fields to index. By default, we deny indexing on the metadata field (which can sometimes be very long due to the parsed HTML from PDFs) but users can override this either in advance, or at collection creation time. |
@erichare I'll check this out tomorrow. Thanks. |
Thanks @potter-potter ! |
dimension=embedding_dimension, | ||
options=_options, | ||
) | ||
except APIRequestError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems clunky to have all this happen under an except. Is there a better way to prequalify the collection?
Also, I assume .create_collection just connects if the collection is already created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@potter-potter you are right. to be honest, this logic was just taken from another integration with a different library, but i think a lot of it is legacy in terms of how the library behaves. In this instance, just creating the collection (which as you said, will connect if the collection already exists) will be enough.
If there is an APIRequestError due to legacy indexing settings, which is what the logic was intending to handle, there are obvious ways now for the user to address it, AND i dont expect this to be the case for Unstructured users in particular.
All that said, good call out and i'll just clean up this code by removing the try / except clause - connect or create, any error will be raised (which should almost never occur)
self._astra_db_collection = self._astra_db.collection( | ||
collection_name=collection_name, | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else doesn't seem right. It will run, i believe, if the try completes successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented above, but this will no longer be part of the code
@@ -35,6 +34,8 @@ class SimpleAstraConfig(BaseConnectorConfig): | |||
access_config: AstraAccessConfig | |||
collection_name: str | |||
embedding_dimension: int | |||
namespace: t.Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want them exposed to the cli, namespace and requested_indexing_policy need to be added to
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/cli/cmds/astra.py
Which will also be helpful since they will have some documentation there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! i tried to match the data model to the "required" option for the CLI, hopefully this looks good.
@potter-potter just tried to address your comments. agree with all of them and explained a little for the prior (misguided, lol) motivation :) let me know if this looks better! |
["--requested-indexing-policy"], | ||
required=False, | ||
default=None, | ||
type=str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import Dict at the top
from unstructured.ingest.cli.interfaces import CliConfig, Dict
type=Dict(),
help="The indexing policy to use for the collection."
'example: \'{"blablabla":"blablabla"}\' ',
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
@erichare This is looking good. I can take over once you make the little dict change. |
Thanks @potter-potter ! I made the update, does it look okay? |
@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks! |
Thank you very much! |
@erichare just to keep you updated. I have this in a branch. And was going to just include the |
This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.