Replies: 1 comment 1 reply
-
@EtienneT That's a good question. Our with_openai example may be helpful - we're about to publish a new blog post that discusses the example, RAG and partitions. In your use case, a tradeoff could be to categorize your documents and use static partitions instead, i.e. a partition would be a category. Partitioning by document can easily become a problem in terms of resources usage, spinning up a huge number of containers, which should ideally be avoided. We discussed this internally before working on the The ideal would be to have categories that are small enough so that re-creating embeddings per category is not a problem, e.g. not too many tokens per category so that re-creating embeddings is not too expensive, but to have a deterministic number of partitions. |
Beta Was this translation helpful? Give feedback.
-
Let's say I want to do a RAG pipeline in dagster. I want to ingest and embed a lot of documents. Ideally when documents changes, I want to re-embed them.
Ideally I would make a software defined asset called document_embedding where I have one dynamic partition per document which would calculate the vector embedding for a particular document. This way I could make a sensor that would trigger the materialization of a single partition when a document needs to get re-embedded... I could also leverage auto-materialization for downstream assets etc...
The problem is that dagster partition are limited to 25k partitions from what I can read.
The other way would be to use sensors to trigger a job to embed a specific document... But then we don't leverage partitions.
This article from the dagster blog use assets, but they are not partitionned, so I guess we would embed and re-insert in the vector db for all documents when we materialize? This doesn't make any economic sense.
What would be the recommended way?
Thanks,
Beta Was this translation helpful? Give feedback.
All reactions