feat: Support Indexing options for Astra DB columns #2919

erichare · 2024-04-22T14:58:39Z

This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.

erichare · 2024-04-22T15:05:12Z

@potter-potter Keeping this as a draft for now as its a fairly decent restructuring of the initialization process, but we've had users that have had issues with the integration because they have some columns that are very long text columns which by default get indexed. The goal of this PR is to allow the users to specify at creation time which columns are not to be indexed, because Astra has a limit internally.

Would love to run the lint and other checks on this if possible! I tried to run as much as possible locally for now.

erichare · 2024-04-22T18:34:13Z

Marked it as ready for review now after some testing internally with the team. The primary change here is we give flexibility in which Astra DB fields to index. By default, we deny indexing on the metadata field (which can sometimes be very long due to the parsed HTML from PDFs) but users can override this either in advance, or at collection creation time.

potter-potter · 2024-04-23T00:28:55Z

@erichare I'll check this out tomorrow. Thanks.

erichare · 2024-04-24T15:02:58Z

Thanks @potter-potter !

potter-potter · 2024-04-25T20:21:20Z

unstructured/ingest/connector/astra.py

+                    dimension=embedding_dimension,
+                    options=_options,
+                )
+            except APIRequestError:


This seems clunky to have all this happen under an except. Is there a better way to prequalify the collection?

Also, I assume .create_collection just connects if the collection is already created.

@potter-potter you are right. to be honest, this logic was just taken from another integration with a different library, but i think a lot of it is legacy in terms of how the library behaves. In this instance, just creating the collection (which as you said, will connect if the collection already exists) will be enough.

If there is an APIRequestError due to legacy indexing settings, which is what the logic was intending to handle, there are obvious ways now for the user to address it, AND i dont expect this to be the case for Unstructured users in particular.

All that said, good call out and i'll just clean up this code by removing the try / except clause - connect or create, any error will be raised (which should almost never occur)

potter-potter · 2024-04-25T20:23:23Z

unstructured/ingest/connector/astra.py

+                        self._astra_db_collection = self._astra_db.collection(
+                            collection_name=collection_name,
+                        )
+                else:


This else doesn't seem right. It will run, i believe, if the try completes successfully.

Commented above, but this will no longer be part of the code

potter-potter · 2024-04-25T20:26:03Z

unstructured/ingest/connector/astra.py

@@ -35,6 +34,8 @@ class SimpleAstraConfig(BaseConnectorConfig):
    access_config: AstraAccessConfig
    collection_name: str
    embedding_dimension: int
+    namespace: t.Optional[str] = None


if you want them exposed to the cli, namespace and requested_indexing_policy need to be added to

https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/cli/cmds/astra.py

Which will also be helpful since they will have some documentation there.

Done! i tried to match the data model to the "required" option for the CLI, hopefully this looks good.

erichare · 2024-04-25T20:40:35Z

@potter-potter just tried to address your comments. agree with all of them and explained a little for the prior (misguided, lol) motivation :) let me know if this looks better!

potter-potter · 2024-04-27T17:51:58Z

unstructured/ingest/cli/cmds/astra.py

+                ["--requested-indexing-policy"],
+                required=False,
+                default=None,
+                type=str,


import Dict at the top
from unstructured.ingest.cli.interfaces import CliConfig, Dict

type=Dict(), help="The indexing policy to use for the collection." 'example: \'{"blablabla":"blablabla"}\' ',

potter-potter · 2024-04-27T17:52:55Z

@erichare This is looking good. I can take over once you make the little dict change.

erichare · 2024-04-28T00:04:16Z

@erichare This is looking good. I can take over once you make the little dict change.

Thanks @potter-potter ! I made the update, does it look okay?

potter-potter · 2024-04-29T01:57:33Z

@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!

erichare · 2024-04-29T01:58:46Z

@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!

Thank you very much!

potter-potter · 2024-05-02T21:22:06Z

@erichare just to keep you updated. I have this in a branch. And was going to just include the feat: Astra DB Source Connector Support at the same time. (better to do everything at once to get it merged.) But Astra DB Source has some issues I need to debug. So working on that. Will keep you updated. And may ask for your help.

#2964

erichare added 2 commits April 22, 2024 07:57

FEAT: Support Indexing options for Astra DB columns

c814b55

FIX lint check

35a84ea

erichare changed the title ~~FEAT: Support Indexing options for Astra DB columns~~ feat: Support Indexing options for Astra DB columns Apr 22, 2024

erichare added 5 commits April 22, 2024 09:16

Pass the namespace param

527ea09

Use pre-referenced collection name var

d194b08

Set the separator to an underscore for metadata fields

7ed3c3a

Deny indexing on full metadata by default

ee4a5ed

Ruff check fix

3df6417

erichare marked this pull request as ready for review April 22, 2024 18:33

FIX: no default for the options for compat with other integrations

4976fb0

Merge branch 'main' into feat/astra-db-indexing-options

275674b

potter-potter reviewed Apr 25, 2024

View reviewed changes

Clean up initialization based on feedback

9c87fec

Merge branch 'main' into feat/astra-db-indexing-options

acc2edb

potter-potter reviewed Apr 27, 2024

View reviewed changes

Update indexing policy type and provide example

66bee62

Merge branch 'main' into feat/astra-db-indexing-options

78350a2

Remove mistaken venv commit

d302e7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Indexing options for Astra DB columns #2919

feat: Support Indexing options for Astra DB columns #2919

erichare commented Apr 22, 2024

erichare commented Apr 22, 2024

erichare commented Apr 22, 2024

potter-potter commented Apr 23, 2024

erichare commented Apr 24, 2024

potter-potter Apr 25, 2024

erichare Apr 25, 2024

potter-potter Apr 25, 2024 •

edited

erichare Apr 25, 2024

potter-potter Apr 25, 2024 •

edited

erichare Apr 25, 2024

erichare commented Apr 25, 2024

potter-potter Apr 27, 2024

erichare Apr 28, 2024

potter-potter commented Apr 27, 2024

erichare commented Apr 28, 2024

potter-potter commented Apr 29, 2024

erichare commented Apr 29, 2024

potter-potter commented May 2, 2024 •

edited

feat: Support Indexing options for Astra DB columns #2919

Are you sure you want to change the base?

feat: Support Indexing options for Astra DB columns #2919

Conversation

erichare commented Apr 22, 2024

erichare commented Apr 22, 2024

erichare commented Apr 22, 2024

potter-potter commented Apr 23, 2024

erichare commented Apr 24, 2024

potter-potter Apr 25, 2024

Choose a reason for hiding this comment

erichare Apr 25, 2024

Choose a reason for hiding this comment

potter-potter Apr 25, 2024 • edited

Choose a reason for hiding this comment

erichare Apr 25, 2024

Choose a reason for hiding this comment

potter-potter Apr 25, 2024 • edited

Choose a reason for hiding this comment

erichare Apr 25, 2024

Choose a reason for hiding this comment

erichare commented Apr 25, 2024

potter-potter Apr 27, 2024

Choose a reason for hiding this comment

erichare Apr 28, 2024

Choose a reason for hiding this comment

potter-potter commented Apr 27, 2024

erichare commented Apr 28, 2024

potter-potter commented Apr 29, 2024

erichare commented Apr 29, 2024

potter-potter commented May 2, 2024 • edited

potter-potter Apr 25, 2024 •

edited

potter-potter Apr 25, 2024 •

edited

potter-potter commented May 2, 2024 •

edited