Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binder exception: Cannot retrieve child type of type ANY. LIST or ARRAY is expected. #3507

Open
andriichumak opened this issue May 16, 2024 · 3 comments
Assignees
Labels
usability Issues related to better usability experience, including bad error messages

Comments

@andriichumak
Copy link

Hi team. I'm trying to use Kuzu for semantic search using array_cosine_similarity and can't get it working. This looks like a bug to me.

Here is the minimal repro:

import kuzu
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
embedded = model.encode(["test"])[0].tolist()  # list of 384 float numbers

db = kuzu.Database("./demo_db")
conn = kuzu.Connection(db)

conn.execute("CREATE NODE TABLE MyNode(id STRING, embedding DOUBLE[384], PRIMARY KEY(id))")
conn.execute("CREATE (d:MyNode {id: 'test', embedding: $emb})", {"emb": embedded})
response = conn.execute(
    """
    MATCH (d:MyNode)
    RETURN d.id, array_cosine_similarity(d.embedding, $emb)
    """, {"emb": embedded}
)

# Here I get exception
# RuntimeError: Binder exception: Cannot retrieve child type of type ANY. LIST or ARRAY is expected.

Kuzu 0.4.2
Python 3.11.9

@prrao87 prrao87 added the usability Issues related to better usability experience, including bad error messages label May 16, 2024
@prrao87
Copy link
Member

prrao87 commented May 16, 2024

Hi @andriichumak, yup, it seems like we have to figure out the right way to promote types in this very common scenario for embeddings. Here's the sequence of steps leading to this issue:

  • sentence-transformers in Python returns a list of floats, where each item is guaranteed to be of length 384.
  • Node table creation specifies that the embedding column is an ARRAY of length 384
  • Node insertion uses the same embedded list of floats from Python, which when passed to Kùzu, is cast to DOUBLE[], which is a LIST of doubles
  • The array_cosine_similarity function works only on ARRAY types, and in this case, the embedding is being compared against itself, and both are of type LIST.

Note that in Kùzu, ARRAY is a special case of LIST - the only difference between a LIST and an ARRAY in Kùzu from a user perspective is that the ARRAY has a fixed length that's known beforehand. When the Python list of floats is passed to Kùzu, it's cast to a LIST (which is the correct behaviour for reasons of generality) because Python lists are dynamic in nature and their lengths cannot be assumed to be always fixed.

The workaround here is to perform explicit casting of the embedded variable to the type DOUBLE[384], which transforms the LIST to an ARRAY, and then it works:

# Slightly rephrase the MATCH query
response = conn.execute(
    """
    MATCH (d:MyNode)
    WITH d, CAST($emb, "DOUBLE[384]") AS emb
    RETURN d.id, array_cosine_similarity(d.embedding, emb)
    """, {"emb": embedded}
)

Result:

┌──────┬─────────────────────────────────┐
│ d.id ┆ ARRAY_COSINE_SIMILARITY(d.embe… │
│ ---  ┆ ---                             │
│ str  ┆ f64                             │
╞══════╪═════════════════════════════════╡
│ test ┆ 1.0                             │
└──────┴─────────────────────────────────┘

Potential Improvements

I think this pattern of usage for embeddings is incredibly common though, and it's not ideal that the user has to perform explicit casting in this manner. It's also a little hard to remember the syntax of explicit casting for new users to Kùzu. Maybe we could make some better assumptions about the fact that users will bring in embeddings from Python libraries like sentence-transformers, which are guaranteed to return a fixed-length list for an embedding. So it can be considered "safe" for us to promote the LIST type to ARRAY inside the array_cosine_similarity function?

@andyfengHKU, I think we need to put a bit more thought into this, as I faced a similar issue (as no doubt others will) in #3481.

@andriichumak
Copy link
Author

andriichumak commented May 16, 2024

Hey @prrao87. Thanks a lot for a quick feedback. The suggested solution works.

One thing I noticed is that if I define the column type as LIST instead of fixed length array (i.e. DOUBLE[] instead of DOUBLE[384]), it still fails with the same error. Looks like the issue is not that the final execution argument is a list, but rather that it's considered to have the type ANY for some reason.

UPD: OK, I missed the part that array similarity function does not work on lists. Still, the error message is suspicious, I'd assume both ARRAY and LIST should be fine, and the issue is that the typing is lost somewhere along the way.

@prrao87
Copy link
Member

prrao87 commented May 16, 2024

Yup, fully agree. There's something regarding the behaviour we need to change internally to make this easier, because the way most users bring in embeddings into Kùzu is from numpy/python. @andyfengHKU will have some ideas on this. Thanks for reporting!

Update:

I'd assume both ARRAY and LIST should be fine, and the issue is that the typing is lost somewhere along the way.

Well, similarity calculation only works on same-size lists, so at least one of the two lists being compared must be an array. However, the fact that it assumes the internals of the LIST are of type ANY is way too broad (to capture all possibilities), so we need to think about how to cast types internally for this use case without breaking other things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usability Issues related to better usability experience, including bad error messages
Projects
None yet
Development

No branches or pull requests

4 participants