-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Session Cleanup+Simplifications #7315
Conversation
Ended up spinning out the support for serving nested execs from the executor here (ended up just becoming removal of the shim entirely). Coming back here now to finish up the rest of this refactor on top of that. |
0c8f915
to
1878470
Compare
dd3c007
to
a615f02
Compare
6f2b173
to
55a8e04
Compare
9542792
to
77406e4
Compare
77406e4
to
a5932ad
Compare
927d700
to
2780ed7
Compare
3c2985e
to
632daea
Compare
632daea
to
22deba6
Compare
This comment was marked as resolved.
This comment was marked as resolved.
@vito FYI I think I may have hit the theoretical flake you described in the OTEL PR:
(tangential feature request - easier to copy logs from the cloud traces output 😄) Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue. |
Yep, that's the one. Sorry about that! I have an idea of how to fix it but it's a bit tricky. The problem is that trace and log data arrives independently, and we can't know whether a span has logs until we see logs for it for the first time, after which point we wait until EOF. But the test can still flake if we don't see the start of the logs before calling In terms of an actual fix, I'm thinking we would need to set an attribute on the span to indicate that logs or at least an EOF should be consumed for it, and update the draining logic accordingly. But the problem is we don't control the span creation. We can man-in-the-middle it, and look for spans starting with |
ce8cf09
to
4b96eda
Compare
This comment was marked as resolved.
This comment was marked as resolved.
On a positive note, with the extra telemetry draining fix commits appended to this PR, I'm seeing full engine tests take as low as 9 minutes, which is the fastest I've seen in a very long time 🎉 |
This comment was marked as resolved.
This comment was marked as resolved.
4b96eda
to
9a28a0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits/questions!
@@ -111,7 +111,7 @@ func (c *Client) diffcopy(ctx context.Context, opts engine.LocalImportOpts, msg | |||
|
|||
ctx = opts.AppendToOutgoingContext(ctx) | |||
|
|||
clientCaller, err := c.GetSessionCaller(ctx, true) | |||
clientCaller, err := c.GetSessionCaller(ctx, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about this true
becoming false
- is it somehow guaranteed that it'll be available immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, previously the buildkit session attachables got connected in parallel with gql requests, but now the whole session is initialized by a synchronous request to connect the buildkit session attachables, so it's not possible to have a race here anymore.
I went ahead and changed these values because at one point I had a bug that caused the attachables to fail, but it manifested as these lines deadlocking. Fixed the bug obviously but now if something goes wrong in the future it will be an error rather than deadlock 🙂
}} | ||
|
||
// there are fast retries server-side so we can start out with a large interval here | ||
if err := retry(ctx, 3*time.Second, func(elapsed time.Duration, ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice seeing this retry logic going away. Guessing it's just not needed anymore, for reasons similar to the other comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, once we've successfully made any request from a given client in a given session, we know that the client+session state are fully initialized. So because we have already made a request successfully at this point (to hook up the attachables) we know we're good to go and don't need this "retry a request" stuff anymore.
return nil | ||
} | ||
|
||
func ConnectBuildkitSession( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat!
engine/server/bk_session.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always gonna read this as "burger king session" and I'm OK with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆 I had the same thought while writing it, also OK with it
There's an enormous amount of buildkit+containerd entities like snapshotters, various cache dbs, the solver, worker+executor, etc. which are all important to understand, but previously they were all setup all over the place which was extremely confusing to follow. This consolidates them all to be setup and stored in one place. There's quite a bit there, but at least you don't have to search far and wide to know what exists and how it's configured. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
b8ab108
to
bfe69a3
Compare
This commit focuses on making sessions + associated state management simpler: * More comprehensible+centralized state management * Rather than spread all over the place and tied together in random places, all of the state associated with a given session is in a daggerSession object and all of the state associated with a given client in a session is a daggerClient object * The code is also a lot more structured and "boring" in terms of locking/mutating state/etc. Not a rube goldberg machine anymore * The whole "pre-register a nested client's state before it calls", which was a fountain of confusion and bugs, is gone. * No more insane gRPC tunneling, the engine API is just an HTTP server now * graphQL http requests are just that, don't have to tunnel them through gRPC streams * session attachables are still gRPC based, but over a hijacked http conn (as opposed to a gRPC stream embedded in another gRPC stream) * This allowed us to move off the Session method from buildkit's upstream controller interface * That in turn let us delete huge chunks of complicated code around handing conns (i.e. engine/server/conn.go) and no longer need to be paranoid about gRPC max message limits in as many places There are more details in the PR description (7315). Signed-off-by: Erik Sipsma <erik@sipsma.dev>
The --experimental-privileged-nesting flag was broken when used with terminal due to a panic around registering clients. This was fixed by commits before this one which completely removed the need to register clients, but backfilling the coverage now. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
bfe69a3
to
89e2b9a
Compare
Otherwise we only use the otel spans from the initial client connect, not any per-request config. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Another part of the effort to support #6916 while also doing general cruft cleanup and setup for various upcoming efforts.
This changeset focuses on making sessions + associated state management simpler:
daggerSession
object and all of the state associated with a given client in a session is adaggerClient
objectSession
method from buildkit's upstream controller interfaceengine/server/conn.go
) and no longer need to be paranoid about gRPC max message limits in as many placesDetails
Objects + state + naming
Server
- formerly known asBuildkitController
Server
, which was really more like the session state (and was thus confusing)clientMetadata
(which is now sent in an http header). Code.BuildkitController
API since we do have some reliance onListWorker
at least, though we are free to change any+all of those to core API (i.e. gql) calls at any point (just got a bit too out of scope here)daggerSession
anddaggerClient
uninitialized
,initialized
,deleted
(not complicated enough to go full-on state machine, but this still makes it all more obvious and easy to follow I think, especially when it comes to locking the state for mutations)core.Query
andbuildkit.Client
to be in these structs too so there's less places to look+think aboutClientCallContext
- instead of trying to register all of that we're "stateless" in that the module+function-call metadata for a client is just plumbed throughExecutionMetadata
+ClientMetadata
, following the request path rather than both that and a pre-registration side-channelClientMetadata
it was plumbed in the requests made by the nested clientHTTP/2 Usage
BuildkitController.Session
API and it's associated complications, etc.)docker-container
,kube-pod
, etc. (spawns a subprocess)Session Attachables
/sessionAttachables
http endpoint, which the server hijacks and uses as a raw conn for establishing the gRPC streamsSessionManager.HandleHTTP
method, which is somewhat similar, but it didn't handle the switch from http->grpc synchronously and was resulting in data getting mixed sometimes