[Feature Request]: Issue with Data Structure in Chroma DB Collections #2170

donnadulcinea · 2024-05-09T00:43:14Z

Describe the problem

I've noticed an issue with the way collections are structured in Chroma DB that makes data retrieval less efficient and more complex than it needs to be. When I retrieve a collection, I expect a collection of entities, but instead, I get many collections of entity components.

Here's an example of how I currently have to retrieve ids and some metadata from a collection:

for x in range(len(collection.get()["ids"])):
    id = collection.get()["ids"][x]
    metadata = collection.get()["metadatas"][x]
    source = str(x) + "-" + id + "-" + metadata["title"]
    print(source)

This approach is not ideal from a syntactic point of view, and possibly from a performance perspective as well, because to project some features of an item, I need to retrieve the whole collection, then grab some items according to the ordinal position.

Conceptually, it feels like going to a car dealership to choose a car, but instead of seeing complete cars, you’re shown all the doors in one place and all the wheels in another. In the end, you can’t mix and match parts—you still have to choose items that belong to the same car.

I’m aware that it’s possible to decide whether to include embeddings or filter against features, but this doesn’t fully address the issue. I believe a more intuitive and efficient approach would be to structure collections as collections of entities, rather than collections of entity components.

Has anyone else experienced this issue, or can anyone provide insight into why the data structure is designed this way?

Describe the proposed solution

Seems like a proposal has been made:
https://github.com/amikos-tech/chroma-go/blob/main/types/record.go

The solution should be as simple as a standard dictionary retrieval pattern should be:

collection = db.filter(['id01', 'id02']).include(['embeddings', 'metadatas'])
for entity in collection:
    print(entity['embeddings'] + "-" + entity['metadatas']['title'])

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

No response

The text was updated successfully, but these errors were encountered:

RichardScottOZ · 2024-05-09T12:42:41Z

Agreed. Or if not, a couple of good concrete examples of doing this?

HammadB · 2024-05-09T18:31:19Z

I think this is a comment on a row-based vs column-based return format.

The main reason that chroma exists in this way is because at ingest time, most users have a columnar data structure since thats how the embeddings are generated. Rather than munge that into a row format the thought was it would be nice if that could be dumped directly into chroma. We felt it was a bit odd to accept columnar inputs but return row based outputs.

I think this has been raised a couple of times
#282
#420

We are open to ideas here ! Just think its important we are consistent

donnadulcinea added the enhancement New feature or request label May 9, 2024

tazarov added the needs-cip label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Issue with Data Structure in Chroma DB Collections #2170

[Feature Request]: Issue with Data Structure in Chroma DB Collections #2170

donnadulcinea commented May 9, 2024

RichardScottOZ commented May 9, 2024

HammadB commented May 9, 2024

[Feature Request]: Issue with Data Structure in Chroma DB Collections #2170

[Feature Request]: Issue with Data Structure in Chroma DB Collections #2170

Comments

donnadulcinea commented May 9, 2024

Describe the problem

Describe the proposed solution

Alternatives considered

Importance

Additional Information

RichardScottOZ commented May 9, 2024

HammadB commented May 9, 2024