Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Issue with Data Structure in Chroma DB Collections #2170

Open
donnadulcinea opened this issue May 9, 2024 · 2 comments
Open
Labels
enhancement New feature or request needs-cip

Comments

@donnadulcinea
Copy link

Describe the problem

I've noticed an issue with the way collections are structured in Chroma DB that makes data retrieval less efficient and more complex than it needs to be. When I retrieve a collection, I expect a collection of entities, but instead, I get many collections of entity components.

Here's an example of how I currently have to retrieve ids and some metadata from a collection:

for x in range(len(collection.get()["ids"])):
    id = collection.get()["ids"][x]
    metadata = collection.get()["metadatas"][x]
    source = str(x) + "-" + id + "-" + metadata["title"]
    print(source)

This approach is not ideal from a syntactic point of view, and possibly from a performance perspective as well, because to project some features of an item, I need to retrieve the whole collection, then grab some items according to the ordinal position.

Conceptually, it feels like going to a car dealership to choose a car, but instead of seeing complete cars, you’re shown all the doors in one place and all the wheels in another. In the end, you can’t mix and match parts—you still have to choose items that belong to the same car.

I’m aware that it’s possible to decide whether to include embeddings or filter against features, but this doesn’t fully address the issue. I believe a more intuitive and efficient approach would be to structure collections as collections of entities, rather than collections of entity components.

Has anyone else experienced this issue, or can anyone provide insight into why the data structure is designed this way?

Describe the proposed solution

Seems like a proposal has been made:
https://github.com/amikos-tech/chroma-go/blob/main/types/record.go

The solution should be as simple as a standard dictionary retrieval pattern should be:

collection = db.filter(['id01', 'id02']).include(['embeddings', 'metadatas'])
for entity in collection:
    print(entity['embeddings'] + "-" + entity['metadatas']['title'])

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

No response

@donnadulcinea donnadulcinea added the enhancement New feature or request label May 9, 2024
@RichardScottOZ
Copy link

Agreed. Or if not, a couple of good concrete examples of doing this?

@HammadB
Copy link
Collaborator

HammadB commented May 9, 2024

I think this is a comment on a row-based vs column-based return format.

The main reason that chroma exists in this way is because at ingest time, most users have a columnar data structure since thats how the embeddings are generated. Rather than munge that into a row format the thought was it would be nice if that could be dumped directly into chroma. We felt it was a bit odd to accept columnar inputs but return row based outputs.

I think this has been raised a couple of times
#282
#420

We are open to ideas here ! Just think its important we are consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-cip
Projects
None yet
Development

No branches or pull requests

4 participants