Best way to design a langchain pipeline? #21693

EtienneT · 2024-05-07T17:22:47Z

EtienneT
May 7, 2024

Let's say I want to do a RAG pipeline in dagster. I want to ingest and embed a lot of documents. Ideally when documents changes, I want to re-embed them.

Ideally I would make a software defined asset called document_embedding where I have one dynamic partition per document which would calculate the vector embedding for a particular document. This way I could make a sensor that would trigger the materialization of a single partition when a document needs to get re-embedded... I could also leverage auto-materialization for downstream assets etc...

The problem is that dagster partition are limited to 25k partitions from what I can read.

The other way would be to use sensors to trigger a job to embed a specific document... But then we don't leverage partitions.

This article from the dagster blog use assets, but they are not partitionned, so I guess we would embed and re-insert in the vector db for all documents when we materialize? This doesn't make any economic sense.

What would be the recommended way?

Thanks,

maximearmstrong · 2024-05-07T18:14:55Z

maximearmstrong
May 7, 2024
Maintainer

@EtienneT That's a good question. Our with_openai example may be helpful - we're about to publish a new blog post that discusses the example, RAG and partitions.

In your use case, a tradeoff could be to categorize your documents and use static partitions instead, i.e. a partition would be a category.

Partitioning by document can easily become a problem in terms of resources usage, spinning up a huge number of containers, which should ideally be avoided. We discussed this internally before working on the with_openai example and concluded that partitioning by document in this context would be an anti-pattern.

The ideal would be to have categories that are small enough so that re-creating embeddings per category is not a problem, e.g. not too many tokens per category so that re-creating embeddings is not too expensive, but to have a deterministic number of partitions.

1 reply

EtienneT May 7, 2024
Author

Humm yeah static partitions don't really work. We have several tables representing documents (the tables could be static partitions), but each table could have a lot of documents, and they all can change during the day. So triggering materialization of each static partition when one document changes doesn't make any sense.

The best option I can see so far is using a sensor that has a cursor of the last modified date of the documents in the database table. When they change, the sensor triggers a job for this particular document. A bit like the example for sensors in the documentation, but instead of files, it would be database records. I could make this sensor run every 5 min or less, since with the cursor storing the last modified date, the query would be super fast...

I only want to update the embeddings of stuff that needs to be updated.

Being able to use partitions in this use case would have been awesome since we would have metadata for the runs of every documents getting embeded. But I understand that this can cause problems.

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to design a langchain pipeline? #21693

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Best way to design a langchain pipeline? #21693

EtienneT May 7, 2024

Replies: 1 comment · 1 reply

maximearmstrong May 7, 2024 Maintainer

EtienneT May 7, 2024 Author

EtienneT
May 7, 2024

Replies: 1 comment 1 reply

maximearmstrong
May 7, 2024
Maintainer

EtienneT May 7, 2024
Author