Session Cleanup+Simplifications #7315

sipsma · 2024-05-08T05:15:00Z

Another part of the effort to support #6916 while also doing general cruft cleanup and setup for various upcoming efforts.

This changeset focuses on making sessions + associated state management simpler:

More comprehensible+centralized state management
- Rather than spread all over the place and tied together in random places, all of the state associated with a given session is in a daggerSession object and all of the state associated with a given client in a session is a daggerClient object
- The code is also a lot more structured and "boring" in terms of locking/mutating state/etc. Not a rube goldberg machine anymore
- The whole "pre-register a nested client's state before it calls", which was a fountain of confusion and bugs, is gone.
  - e.g. a bug was reported recently with use of terminal and nested sessions that was caused by this registration, but this PR had already accidentally fixed it, so there's just a commit with test coverage here
No more insane gRPC tunneling, the engine API is just an HTTP server now
- graphQL http requests are just that, don't have to tunnel them through gRPC streams
- session attachables are still gRPC based, but over a hijacked http conn (as opposed to a gRPC stream embedded in another gRPC stream)
- This allowed us to move off the Session method from buildkit's upstream controller interface
- That in turn let us delete huge chunks of complicated code around handing conns (i.e. engine/server/conn.go) and no longer need to be paranoid about gRPC max message limits in as many places
- This also allowed us to enable connection re-use for requests from the engine client to the engine server
The overall engine-wide state (mostly various buildkit+containerd entities) is also centralized now rather than spread confusingly amongst many files, which is slightly tangential but supported the above efforts.

Details

Objects + state + naming

Server - formerly known as BuildkitController
- This is not to be confused with the thing previously called Server, which was really more like the session state (and was thus confusing)
- All the "global" state for various buildkit+containerd entities like snapshotters, various cache dbs, the solver, worker+executor, etc. Also top level state for which sessions currently exist
  - There's a lot in there, but I personally much prefer it in one place rather than spread all over.
- Serves an HTTP API for gql queries, session attachables and shutdown, with requests scoped to the client based on clientMetadata (which is now sent in an http header). Code.
- Still implements the BuildkitController API since we do have some reliance on ListWorker at least, though we are free to change any+all of those to core API (i.e. gql) calls at any point (just got a bit too out of scope here)
daggerSession and daggerClient
- Basically what it says on the tin: the session-wide state for each session and the client-specific state for each client in a session
- Does state tracking with an enum of possible states like uninitialized, initialized, deleted (not complicated enough to go full-on state machine, but this still makes it all more obvious and easy to follow I think, especially when it comes to locking the state for mutations)
- I moved all the state that used to co-exist in core.Query and buildkit.Client to be in these structs too so there's less places to look+think about
- One notable thing gone is ClientCallContext - instead of trying to register all of that we're "stateless" in that the module+function-call metadata for a client is just plumbed through ExecutionMetadata+ClientMetadata, following the request path rather than both that and a pre-registration side-channel
  - e.g. here's where the executor supplies the ClientMetadata it was plumbed in the requests made by the nested client
- The logic for deciding when to end a session is now just "when the main client caller has no more active connections", at which point the session state is torn down and released. This is done by just incrementing and decrementing that count at the beginning/end of each http request

HTTP/2 Usage

I updated all the HTTP servers we create to explicitly support HTTP/2 (with h2c, aka no TLS), while also supporting HTTP/1 clients too
The main motivation here was:
- We wanted to get rid of the gRPC tunneling (for simplicity, performance, detaching from the BuildkitController.Session API and it's associated complications, etc.)
- But that meant that every time the http client needed to add to the connection pool it would have to invoke the connhelper (rather than open a new gRPC stream) which is an expensive operation for e.g. docker-container, kube-pod, etc. (spawns a subprocess)
- HTTP/2 solves that problem though via stream multiplexing; go's http/2 client by default only needs a pool of 2 conns (one for reqs, one for resps) and can just multiplex everything from there
  - I briefly looked at HTTP/3 since just sending udp packets back and forth would be even simpler conceptually, but it's still too immature+low-level in the go ecosystem
This seems to have worked pretty seamlessly, other than one gotcha I hit where typescript tests using node 18 only were erroring out (fix with details here)
There is also still a need to serve some gRPC APIs for the few remaining buildkit controller APIs we use and OTel, which is done via gRPC http handlers
- The docs on that suggest there are some missing advanced gRPC features (e.g. BFD, big frame detection, which is a performance optimization) but none of them have been obviously relevant to our use case. Utterly worst case scenario, there is a fallback option of serving http + grpc on separate listeners, but avoiding that complication unless proven 100% necessary
- These can also be migrated to pure graphql/plain-http APIs as desired

Session Attachables

As mentioned above, session attachables no longer require 2 layers of gRPC tunnels; instead there's just a /sessionAttachables http endpoint, which the server hijacks and uses as a raw conn for establishing the gRPC streams
That "hijack and invert client/server relationship" process involves a small dance in order to be robust against accidentally mixing/overlapping http+gRPC traffic, which can, unsurprisingly, confuse the computer
- Client-side and server-side implementation, with comments explaining. Basically just http req/resp + a 1 byte ack to synchronize the switch to gRPC
- I wanted to use upstream buildkit's builtin SessionManager.HandleHTTP method, which is somewhat similar, but it didn't handle the switch from http->grpc synchronously and was resulting in data getting mixed sometimes
A nice side effect of this in combination with the session state simplifications is that we no longer need to do the whole "retry making a request to verify the session is working"
- Instead, if we successfully connect these session attachables, we can know the session as a whole was successfully initialized and we can unblock client connect, returning to the caller
- That had more nice side effects of reducing possibilities of race conditions when requesting the caller session in various server-side APIs

sipsma · 2024-05-20T21:26:17Z

Ended up spinning out the support for serving nested execs from the executor here (ended up just becoming removal of the shim entirely). Coming back here now to finish up the rest of this refactor on top of that.

cmd/engine/main.go

analytics/analytics.go

sipsma · 2024-06-08T02:42:56Z

@vito FYI I think I may have hit the theoretical flake you described in the OTEL PR:

1534

    client_test.go:104: 

1535

        	Error Trace:	/app/core/integration/client_test.go:104

1536

        	Error:      	Not equal: 

1537

        	            	expected: 1

1538

        	            	actual  : 0

1539

1540

        	Test:       	TestClientMultiSameTrace

(tangential feature request - easier to copy logs from the cloud traces output 😄)

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

vito · 2024-06-08T14:24:02Z

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

Yep, that's the one. Sorry about that! I have an idea of how to fix it but it's a bit tricky. The problem is that trace and log data arrives independently, and we can't know whether a span has logs until we see logs for it for the first time, after which point we wait until EOF. But the test can still flake if we don't see the start of the logs before calling Close(). There's a echo hey; sleep 0.5 to try to counteract that, I suppose we could bump that sleep, but it might still flake under load.

In terms of an actual fix, I'm thinking we would need to set an attribute on the span to indicate that logs or at least an EOF should be consumed for it, and update the draining logic accordingly. But the problem is we don't control the span creation. We can man-in-the-middle it, and look for spans starting with exec I suppose, simillar to how we already man-in-the-middle the [internal] prefix into an attribute.

sipsma · 2024-06-10T09:33:15Z

On a positive note, with the extra telemetry draining fix commits appended to this PR, I'm seeing full engine tests take as low as 9 minutes, which is the fastest I've seen in a very long time 🎉

engine/client/client.go

vito

Just nits/questions!

vito · 2024-06-11T23:00:34Z

engine/buildkit/filesync.go

@@ -111,7 +111,7 @@ func (c *Client) diffcopy(ctx context.Context, opts engine.LocalImportOpts, msg

 	ctx = opts.AppendToOutgoingContext(ctx)

-	clientCaller, err := c.GetSessionCaller(ctx, true)
+	clientCaller, err := c.GetSessionCaller(ctx, false)


Curious about this true becoming false - is it somehow guaranteed that it'll be available immediately?

Yep, previously the buildkit session attachables got connected in parallel with gql requests, but now the whole session is initialized by a synchronous request to connect the buildkit session attachables, so it's not possible to have a race here anymore.

I went ahead and changed these values because at one point I had a bug that caused the attachables to fail, but it manifested as these lines deadlocking. Fixed the bug obviously but now if something goes wrong in the future it will be an error rather than deadlock 🙂

vito · 2024-06-11T23:05:03Z

engine/client/client.go

-	}}
-
-	// there are fast retries server-side so we can start out with a large interval here
-	if err := retry(ctx, 3*time.Second, func(elapsed time.Duration, ctx context.Context) error {


Nice seeing this retry logic going away. Guessing it's just not needed anymore, for reasons similar to the other comment?

Yep, once we've successfully made any request from a given client in a given session, we know that the client+session state are fully initialized. So because we have already made a request successfully at this point (to hook up the attachables) we know we're good to go and don't need this "retry a request" stuff anymore.

vito · 2024-06-11T23:17:16Z

engine/client/client.go

+	return nil
+}
+
+func ConnectBuildkitSession(


vito · 2024-06-11T23:18:55Z

engine/server/bk_session.go

I'm always gonna read this as "burger king session" and I'm OK with that

😆 I had the same thought while writing it, also OK with it

engine/server/session.go

engine/buildkit/cleanup.go

core/container_exec.go

There's an enormous amount of buildkit+containerd entities like snapshotters, various cache dbs, the solver, worker+executor, etc. which are all important to understand, but previously they were all setup all over the place which was extremely confusing to follow. This consolidates them all to be setup and stored in one place. There's quite a bit there, but at least you don't have to search far and wide to know what exists and how it's configured. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

This commit focuses on making sessions + associated state management simpler: * More comprehensible+centralized state management * Rather than spread all over the place and tied together in random places, all of the state associated with a given session is in a daggerSession object and all of the state associated with a given client in a session is a daggerClient object * The code is also a lot more structured and "boring" in terms of locking/mutating state/etc. Not a rube goldberg machine anymore * The whole "pre-register a nested client's state before it calls", which was a fountain of confusion and bugs, is gone. * No more insane gRPC tunneling, the engine API is just an HTTP server now * graphQL http requests are just that, don't have to tunnel them through gRPC streams * session attachables are still gRPC based, but over a hijacked http conn (as opposed to a gRPC stream embedded in another gRPC stream) * This allowed us to move off the Session method from buildkit's upstream controller interface * That in turn let us delete huge chunks of complicated code around handing conns (i.e. engine/server/conn.go) and no longer need to be paranoid about gRPC max message limits in as many places There are more details in the PR description (7315). Signed-off-by: Erik Sipsma <erik@sipsma.dev>

The --experimental-privileged-nesting flag was broken when used with terminal due to a panic around registering clients. This was fixed by commits before this one which completely removed the need to register clients, but backfilling the coverage now. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

Otherwise we only use the otel spans from the initial client connect, not any per-request config. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

This was referenced May 10, 2024

Support Socket args from the CLI #6747

Open

core: fix custom CA certs in modules + add integ test coverage #7356

Merged

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 0c8f915 to 1878470 Compare May 24, 2024 02:50

sipsma mentioned this pull request May 31, 2024

engine: fix engine panic during dagger call ... terminal with nesting #7519

Closed

2 tasks

sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from dd3c007 to a615f02 Compare June 4, 2024 19:22

sipsma mentioned this pull request Jun 5, 2024

core: reduce module ID size by not including full schemas #7549

Merged

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 6f2b173 to 55a8e04 Compare June 5, 2024 05:59

sipsma mentioned this pull request Jun 5, 2024

OTel cleanups + better draining #7452

Merged

6 tasks

sipsma commented Jun 5, 2024

View reviewed changes

cmd/engine/main.go Outdated Show resolved Hide resolved

sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from 9542792 to 77406e4 Compare June 6, 2024 20:55

sipsma added this to the v0.11.7 milestone Jun 6, 2024

sipsma force-pushed the refactor-server-and-bk branch from 77406e4 to a5932ad Compare June 7, 2024 03:35

sipsma mentioned this pull request Jun 7, 2024

Various TS + general test fixes + improvements #7571

Merged

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 927d700 to 2780ed7 Compare June 7, 2024 20:19

sipsma modified the milestones: v0.11.7, next Jun 7, 2024

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 3c2985e to 632daea Compare June 7, 2024 23:43

sipsma force-pushed the refactor-server-and-bk branch from 632daea to 22deba6 Compare June 8, 2024 00:12

sipsma changed the title ~~WIP refactor of server/sessions/buildkit-interfaces~~ Session Cleanup+Simplifications Jun 8, 2024

sipsma marked this pull request as ready for review June 8, 2024 01:31

sipsma requested review from vito and jedevc June 8, 2024 01:35

sipsma commented Jun 8, 2024

View reviewed changes

analytics/analytics.go Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

sipsma force-pushed the refactor-server-and-bk branch from ce8cf09 to 4b96eda Compare June 10, 2024 08:41

This comment was marked as resolved.

Sign in to view

jedevc mentioned this pull request Jun 10, 2024

fix: ensure nested frontend builds get secret translation #7595

Merged

This was referenced Jun 10, 2024

OTel log draining follow-ups #7605

Merged

engine: fix context error wrapping in executor #7608

Merged

executor: fix deadlock in service exit from netns workers #7610

Merged

This comment was marked as resolved.

Sign in to view

sipsma force-pushed the refactor-server-and-bk branch from 4b96eda to 9a28a0c Compare June 11, 2024 22:02

vito reviewed Jun 11, 2024

View reviewed changes

engine/client/client.go Outdated Show resolved Hide resolved

vito approved these changes Jun 11, 2024

View reviewed changes

sipsma force-pushed the refactor-server-and-bk branch from b8ab108 to bfe69a3 Compare June 12, 2024 00:24

sipsma added 2 commits June 11, 2024 17:30

sipsma force-pushed the refactor-server-and-bk branch from bfe69a3 to 89e2b9a Compare June 12, 2024 00:30

client: propagate otel spans+baggage per-request

f26dfaa

Otherwise we only use the otel spans from the initial client connect, not any per-request config. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma merged commit 0aa1af5 into dagger:main Jun 12, 2024
105 checks passed

This was referenced Jun 12, 2024

🐞 transport not supported #7592

Open

ci: run sdk lint+test using dev cli+engine #7628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session Cleanup+Simplifications #7315

Session Cleanup+Simplifications #7315

sipsma commented May 8, 2024 •

edited

sipsma commented May 20, 2024

This comment was marked as resolved.

sipsma commented Jun 8, 2024

vito commented Jun 8, 2024

This comment was marked as resolved.

sipsma commented Jun 10, 2024

This comment was marked as resolved.

vito left a comment

vito Jun 11, 2024

sipsma Jun 11, 2024

vito Jun 11, 2024

sipsma Jun 11, 2024

vito Jun 11, 2024

vito Jun 11, 2024

sipsma Jun 11, 2024

Session Cleanup+Simplifications #7315

Session Cleanup+Simplifications #7315

Conversation

sipsma commented May 8, 2024 • edited

Details

Objects + state + naming

HTTP/2 Usage

Session Attachables

sipsma commented May 20, 2024

This comment was marked as resolved.

sipsma commented Jun 8, 2024

vito commented Jun 8, 2024

This comment was marked as resolved.

sipsma commented Jun 10, 2024

This comment was marked as resolved.

vito left a comment

Choose a reason for hiding this comment

vito Jun 11, 2024

Choose a reason for hiding this comment

sipsma Jun 11, 2024

Choose a reason for hiding this comment

vito Jun 11, 2024

Choose a reason for hiding this comment

sipsma Jun 11, 2024

Choose a reason for hiding this comment

vito Jun 11, 2024

Choose a reason for hiding this comment

vito Jun 11, 2024

Choose a reason for hiding this comment

sipsma Jun 11, 2024

Choose a reason for hiding this comment

sipsma commented May 8, 2024 •

edited