You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We introduce defensive consolidation at several moments in rendered dataflows, to guard against potential explosion of updates that could cancel or otherwise consolidate. These consolidations are optional, but we do them to increase the probability that the dataflow has a good time and doesn't blow up. We have some guidance about when to apply them, for example after Union-Negate patterns where we anticipate some cancelation, but also these are potentially unnecessary: #22320
NB: there are other reasons we must consolidate, like in preparation for an append-only block in SELECT evaluation. There are also performance reasons to consolidate, for example right after a massive projection. This is not meant to be about those consolidations, but rather the purely defensive ones. They are interesting too, though.
I propose that we scribble down some principles about where and when to consolidate, and I have a set of candidate principles to offer (though, I'm up for others with similar foundations).
We should consolidate any non-consolidated data at the moment just before we re-use the collection.
The principle here is that for defensive consolidations we are primarily concerned with avoiding update growth. Update growth happens chiefly when we re-use collections, as in Let/Get patterns, and in bespoke collection re-use in dataflows (e.g. internal to TopK fragment implementations). There are other moments of growth, like FlatMap and temporal filters, but the proposal up there is that the killer moment is when we re-use collections.
The implication I think is that we could move the consolidations from Union/Negate blocks (and potentially other locations) and relocate them to the Let binding where we "publish" the collection for multiple uses. This also happens to be a moment where we may want to perform work quite carefully, as e.g. around snapshot time it would be where we risk having multiple copies of the snapshot live; perhaps it is a good time to put the data into consolidated containers that can be shared, and fed to the multiple users in a controlled manner.
We would also want to carefully understand the basic blocks we already have in the code, to see if we re-use unconsolidated collections internally, and accept (or not) that risk.
The text was updated successfully, but these errors were encountered:
We introduce defensive consolidation at several moments in rendered dataflows, to guard against potential explosion of updates that could cancel or otherwise consolidate. These consolidations are optional, but we do them to increase the probability that the dataflow has a good time and doesn't blow up. We have some guidance about when to apply them, for example after
Union
-Negate
patterns where we anticipate some cancelation, but also these are potentially unnecessary: #22320NB: there are other reasons we must consolidate, like in preparation for an append-only block in
SELECT
evaluation. There are also performance reasons to consolidate, for example right after a massive projection. This is not meant to be about those consolidations, but rather the purely defensive ones. They are interesting too, though.I propose that we scribble down some principles about where and when to consolidate, and I have a set of candidate principles to offer (though, I'm up for others with similar foundations).
The principle here is that for defensive consolidations we are primarily concerned with avoiding update growth. Update growth happens chiefly when we re-use collections, as in
Let
/Get
patterns, and in bespoke collection re-use in dataflows (e.g. internal toTopK
fragment implementations). There are other moments of growth, likeFlatMap
and temporal filters, but the proposal up there is that the killer moment is when we re-use collections.The implication I think is that we could move the consolidations from
Union
/Negate
blocks (and potentially other locations) and relocate them to theLet
binding where we "publish" the collection for multiple uses. This also happens to be a moment where we may want to perform work quite carefully, as e.g. around snapshot time it would be where we risk having multiple copies of the snapshot live; perhaps it is a good time to put the data into consolidated containers that can be shared, and fed to the multiple users in a controlled manner.We would also want to carefully understand the basic blocks we already have in the code, to see if we re-use unconsolidated collections internally, and accept (or not) that risk.
The text was updated successfully, but these errors were encountered: