Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL generated by LLM be limited to the selected training data #317

Open
tzh5477 opened this issue Mar 27, 2024 · 6 comments
Open

SQL generated by LLM be limited to the selected training data #317

tzh5477 opened this issue Mar 27, 2024 · 6 comments

Comments

@tzh5477
Copy link

tzh5477 commented Mar 27, 2024

Due to data security control requirements, can the SQL generated by LLM be limited to the selected training data instead of all the trained data?

@andreped
Copy link
Contributor

andreped commented Mar 27, 2024

Im not sure I understand what you are asking. In terms of sql-training, you feed question-sql pairs and store embeddings for each in a vector store. Then the LLM will use a subset of these question-sql pairs to build the dynamic context. For this retrieval, all pairs in the vector store can be used.

I guess what you want is for some applications, only to make a smaller set of the vector store available to the application? Why would you do this? Why don't you just make two separate vector stores, and then for application A you can use one vector store and application B use another?

Anyways, if you don't want to make certain data available in application A or B, you can just setup a different DB user with access to different tables. That way, if the LLM were to try to make an SQL to retrieve data from parts of the DB that it cannot access, it will not be able to.

What are you trying to do?

@tzh5477
Copy link
Author

tzh5477 commented Mar 27, 2024

You understanding are correct and the solution is good .

I'm a newbie, how can I create and use different vector stores? the relevant code?

Im not sure I understand what you are asking. In terms of sql-training, you feed question-sql pairs and store embeddings for each in a vector store. Then the LLM will use a subset of these question-sql pairs to build the dynamic context. For this retrieval, all pairs in the vector store can be used.

I guess what you want is for some applications, only to make a smaller set of the vector store available to the application? Why would you do this? Why don't you just make two separate vector stores, and then for application A you can use one vector store and application B use another?

Anyways, if you don't want to make certain data available in application A or B, you can just setup a different DB user with access to different tables. That way, if the LLM were to try to make an SQL to retrieve data from parts of the DB that it cannot access, it will not be able to.

What are you trying to do?

@andreped
Copy link
Contributor

I'm a newbie, how can I create and use different vector stores? the relevant code?

But do you want multiple vector stores or multiple DB users with different rights using the same vector store? I am not sure what is most optimal for your use case. But in general, if data privacy is important, you likely want multiple DB users with different rights. And then when communicating with one part of the DB, you use connection A, and then for the other you use another connection B.

Easiest way to solve this is likely to just setup two separate vanna instances (with different DB connectors):
https://github.com/vanna-ai/vanna?tab=readme-ov-file#import

And then in your application, you handle yourself which user can access which vanna instance (and in turn which part of the DB). Does that make sense?

Could also be that instead of having completely separate Vanna instances, you can just override the run_sql method like so:

vanna_instance.run_sql = execute_sql_query_to_dataframe_method_application_A

But this might be problematic if you have multiple users using the same vanna_instance. You dont want to accidentally give user B access to data in connection A.

@tzh5477
Copy link
Author

tzh5477 commented Mar 27, 2024

I'm a newbie, how can I create and use different vector stores? the relevant code?

But do you want multiple vector stores or multiple DB users with different rights using the same vector store? I am not sure what is most optimal for your use case. But in general, if data privacy is important, you likely want multiple DB users with different rights. And then when communicating with one part of the DB, you use connection A, and then for the other you use another connection B.

Easiest way to solve this is likely to just setup two separate vanna instances (with different DB connectors): https://github.com/vanna-ai/vanna?tab=readme-ov-file#import

And then in your application, you handle yourself which user can access which vanna instance (and in turn which part of the DB). Does that make sense?

Could also be that instead of having completely separate Vanna instances, you can just override the run_sql method like so:

vanna_instance.run_sql = execute_sql_query_to_dataframe_method_application_A

But this might be problematic if you have multiple users using the same vanna_instance. You dont want to accidentally give user B access to data in connection A.

multiple database users with different rights using the same vector store.

I don't think overriding the run_sql is a good solution.

@andreped
Copy link
Contributor

multiple database users with different rights using the same vector store.

I don't think overriding the run_sql is a good solution.

Then you need one vanna instance per DB user. That should be rather straight forward to implement. Let me know how it goes :]

@tzh5477
Copy link
Author

tzh5477 commented Apr 3, 2024

I came up with a new solution .

Add a new column to the vector store to distinguish different user groups , based on this field, we can limit the scope of vector searches when users ask questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants