Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principles for dataflow consolidation #27086

Open
frankmcsherry opened this issue May 14, 2024 · 0 comments
Open

Principles for dataflow consolidation #27086

frankmcsherry opened this issue May 14, 2024 · 0 comments
Assignees
Labels
C-refactoring Category: replacing or reorganizing code

Comments

@frankmcsherry
Copy link
Contributor

We introduce defensive consolidation at several moments in rendered dataflows, to guard against potential explosion of updates that could cancel or otherwise consolidate. These consolidations are optional, but we do them to increase the probability that the dataflow has a good time and doesn't blow up. We have some guidance about when to apply them, for example after Union-Negate patterns where we anticipate some cancelation, but also these are potentially unnecessary: #22320

NB: there are other reasons we must consolidate, like in preparation for an append-only block in SELECT evaluation. There are also performance reasons to consolidate, for example right after a massive projection. This is not meant to be about those consolidations, but rather the purely defensive ones. They are interesting too, though.

I propose that we scribble down some principles about where and when to consolidate, and I have a set of candidate principles to offer (though, I'm up for others with similar foundations).

We should consolidate any non-consolidated data at the moment just before we re-use the collection.

The principle here is that for defensive consolidations we are primarily concerned with avoiding update growth. Update growth happens chiefly when we re-use collections, as in Let/Get patterns, and in bespoke collection re-use in dataflows (e.g. internal to TopK fragment implementations). There are other moments of growth, like FlatMap and temporal filters, but the proposal up there is that the killer moment is when we re-use collections.

The implication I think is that we could move the consolidations from Union/Negate blocks (and potentially other locations) and relocate them to the Let binding where we "publish" the collection for multiple uses. This also happens to be a moment where we may want to perform work quite carefully, as e.g. around snapshot time it would be where we risk having multiple copies of the snapshot live; perhaps it is a good time to put the data into consolidated containers that can be shared, and fed to the multiple users in a controlled manner.

We would also want to carefully understand the basic blocks we already have in the code, to see if we re-use unconsolidated collections internally, and accept (or not) that risk.

@frankmcsherry frankmcsherry added the C-refactoring Category: replacing or reorganizing code label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-refactoring Category: replacing or reorganizing code
Projects
None yet
Development

No branches or pull requests

2 participants