Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(FeatureRequest)Get LLM to use specific SQL in the question-sql pair in the chroma database #331

Open
jankariwo opened this issue Apr 2, 2024 · 11 comments

Comments

@jankariwo
Copy link

Ive been setting up the chroma database with ddl,sql and business documentation.
The vn.generate_sql is not doing very well with generating sql when the question contains company specific terms (which have been added in the documentation).
When I used similar1 = vn.get_similar_question_sql(question)....I get 10 results with the first one matching what I have in chroma. So I appended this to the prompt vn.generate_sql(question + similar1[0]).......the results were better.

Feature request:
If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

@jankariwo jankariwo changed the title Get LLM to use specific SQL in the question-sql pair in the chroma database (FeatureRequest)Get LLM to use specific SQL in the question-sql pair in the chroma database Apr 2, 2024
@andreped
Copy link
Contributor

andreped commented Apr 2, 2024

If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

Hello, @jankariwo! :]

Im not sure I follow. If you train on a question-sql pair, it will be stored in the chroma vector DB. If you then attempt to ask the same question again, the question-sql pair should be one of the top 10 most similar question-sql pairs. And then, it will be part of the prompt sent to perform the new SQL completion. It it very likely that it will realize that this question is already part of the context and thus knowns which SQL query to produce. So I would expect it more often than not to yield the same SQL query as output.

Are you not seeing this? Or are you maybe seeing that this does not always seem to be the case, but it works most of the time?

@zainhoda
Copy link
Contributor

zainhoda commented Apr 5, 2024

@andreped I think the feature request is something like:

If the question is exactly the same as one of the trained question-SQL pairs, then instead of using the LLM to generate the SQL, it should bypass the LLM and just return the SQL that was associated with that question.

@jankariwo
Copy link
Author

@andreped I think the feature request is something like:

If the question is exactly the same as one of the trained question-SQL pairs, then instead of using the LLM to generate the SQL, it should bypass the LLM and just return the SQL that was associated with that question.

That is correct interpretation

@jankariwo
Copy link
Author

If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

Hello, @jankariwo! :]

Im not sure I follow. If you train on a question-sql pair, it will be stored in the chroma vector DB. If you then attempt to ask the same question again, the question-sql pair should be one of the top 10 most similar question-sql pairs. And then, it will be part of the prompt sent to perform the new SQL completion. It it very likely that it will realize that this question is already part of the context and thus knowns which SQL query to produce. So I would expect it more often than not to yield the same SQL query as output.

Are you not seeing this? Or are you maybe seeing that this does not always seem to be the case, but it works most of the time?

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

@andreped
Copy link
Contributor

andreped commented Apr 5, 2024

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!

I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

@jiangsiYang
Copy link

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!

I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

Thank you! So, can I understand it this way: 'correct query' stores questions, and each time a search is conducted, it retrieves the question vectors in the 'correct query' dataset. When the similarity exceeds a certain value, it directly returns the answer corresponding to the 'correct query'?

@jankariwo
Copy link
Author

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!
I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

Thank you! So, can I understand it this way: 'correct query' stores questions, and each time a search is conducted, it retrieves the question vectors in the 'correct query' dataset. When the similarity exceeds a certain value, it directly returns the answer corresponding to the 'correct query'?

Yup. It returns the matched query instead of sending a prompt to the LLM to generate the SQL

@andreped
Copy link
Contributor

andreped commented Apr 9, 2024

@zainhoda I assume this feature was not part of the last release? Would be of great interest for our applications as well :]

@zainhoda
Copy link
Contributor

zainhoda commented Apr 9, 2024

@andreped if you (or anyone else) would like to take a quick stab at this, I think this is relatively simple change.

Add something like this:

    # Iterate through each item in the list
    for item in question_sql_list:
        # Check if the current item's question matches the input question
        if item['question'] == question:
            # If a match is found, return the corresponding sql value
            return item['sql']

right after this line:

https://github.com/vanna-ai/vanna/blob/main/src/vanna/base/base.py#L110

and then maybe add a quick test for this

@jankariwo
Copy link
Author

@andreped if you (or anyone else) would like to take a quick stab at this, I think this is relatively simple change.

Add something like this:

    # Iterate through each item in the list
    for item in question_sql_list:
        # Check if the current item's question matches the input question
        if item['question'] == question:
            # If a match is found, return the corresponding sql value
            return item['sql']

right after this line:

https://github.com/vanna-ai/vanna/blob/main/src/vanna/base/base.py#L110

and then maybe add a quick test for this

Thanks @zainhoda. Ive implemented. Added it to my base.py and tested it. I also created my function in vanna class with it to test the outputs. It works perfectly. You just have to lookout for trailing whitespace in the question which might not make them to match. But this can be easily fixed with extra string cleaning code before checking for a match. Thanks again.

@andreped
Copy link
Contributor

@jankariwo Can you make a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants