Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Indexing options for Astra DB columns #2919

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

erichare
Copy link

This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.

@erichare
Copy link
Author

@potter-potter Keeping this as a draft for now as its a fairly decent restructuring of the initialization process, but we've had users that have had issues with the integration because they have some columns that are very long text columns which by default get indexed. The goal of this PR is to allow the users to specify at creation time which columns are not to be indexed, because Astra has a limit internally.

Would love to run the lint and other checks on this if possible! I tried to run as much as possible locally for now.

@erichare erichare changed the title FEAT: Support Indexing options for Astra DB columns feat: Support Indexing options for Astra DB columns Apr 22, 2024
@erichare erichare marked this pull request as ready for review April 22, 2024 18:33
@erichare
Copy link
Author

Marked it as ready for review now after some testing internally with the team. The primary change here is we give flexibility in which Astra DB fields to index. By default, we deny indexing on the metadata field (which can sometimes be very long due to the parsed HTML from PDFs) but users can override this either in advance, or at collection creation time.

@potter-potter
Copy link
Contributor

@erichare I'll check this out tomorrow. Thanks.

@erichare
Copy link
Author

Thanks @potter-potter !

dimension=embedding_dimension,
options=_options,
)
except APIRequestError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems clunky to have all this happen under an except. Is there a better way to prequalify the collection?

Also, I assume .create_collection just connects if the collection is already created.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potter-potter you are right. to be honest, this logic was just taken from another integration with a different library, but i think a lot of it is legacy in terms of how the library behaves. In this instance, just creating the collection (which as you said, will connect if the collection already exists) will be enough.

If there is an APIRequestError due to legacy indexing settings, which is what the logic was intending to handle, there are obvious ways now for the user to address it, AND i dont expect this to be the case for Unstructured users in particular.

All that said, good call out and i'll just clean up this code by removing the try / except clause - connect or create, any error will be raised (which should almost never occur)

self._astra_db_collection = self._astra_db.collection(
collection_name=collection_name,
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else doesn't seem right. It will run, i believe, if the try completes successfully.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented above, but this will no longer be part of the code

@@ -35,6 +34,8 @@ class SimpleAstraConfig(BaseConnectorConfig):
access_config: AstraAccessConfig
collection_name: str
embedding_dimension: int
namespace: t.Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want them exposed to the cli, namespace and requested_indexing_policy need to be added to

https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/cli/cmds/astra.py

Which will also be helpful since they will have some documentation there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! i tried to match the data model to the "required" option for the CLI, hopefully this looks good.

@erichare
Copy link
Author

@potter-potter just tried to address your comments. agree with all of them and explained a little for the prior (misguided, lol) motivation :) let me know if this looks better!

["--requested-indexing-policy"],
required=False,
default=None,
type=str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import Dict at the top
from unstructured.ingest.cli.interfaces import CliConfig, Dict

type=Dict(),
help="The indexing policy to use for the collection."
'example: \'{"blablabla":"blablabla"}\' ',

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@potter-potter
Copy link
Contributor

@erichare This is looking good. I can take over once you make the little dict change.

@erichare
Copy link
Author

@erichare This is looking good. I can take over once you make the little dict change.

Thanks @potter-potter ! I made the update, does it look okay?

@potter-potter
Copy link
Contributor

@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!

@erichare
Copy link
Author

@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!

Thank you very much!

@potter-potter
Copy link
Contributor

potter-potter commented May 2, 2024

@erichare just to keep you updated. I have this in a branch. And was going to just include the feat: Astra DB Source Connector Support at the same time. (better to do everything at once to get it merged.) But Astra DB Source has some issues I need to debug. So working on that. Will keep you updated. And may ask for your help.

#2964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants