Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session Cleanup+Simplifications #7315

Merged
merged 4 commits into from
Jun 12, 2024
Merged

Conversation

sipsma
Copy link
Contributor

@sipsma sipsma commented May 8, 2024

Another part of the effort to support #6916 while also doing general cruft cleanup and setup for various upcoming efforts.

This changeset focuses on making sessions + associated state management simpler:

  1. More comprehensible+centralized state management
    • Rather than spread all over the place and tied together in random places, all of the state associated with a given session is in a daggerSession object and all of the state associated with a given client in a session is a daggerClient object
    • The code is also a lot more structured and "boring" in terms of locking/mutating state/etc. Not a rube goldberg machine anymore
    • The whole "pre-register a nested client's state before it calls", which was a fountain of confusion and bugs, is gone.
      • e.g. a bug was reported recently with use of terminal and nested sessions that was caused by this registration, but this PR had already accidentally fixed it, so there's just a commit with test coverage here
  2. No more insane gRPC tunneling, the engine API is just an HTTP server now
    • graphQL http requests are just that, don't have to tunnel them through gRPC streams
    • session attachables are still gRPC based, but over a hijacked http conn (as opposed to a gRPC stream embedded in another gRPC stream)
    • This allowed us to move off the Session method from buildkit's upstream controller interface
    • That in turn let us delete huge chunks of complicated code around handing conns (i.e. engine/server/conn.go) and no longer need to be paranoid about gRPC max message limits in as many places
    • This also allowed us to enable connection re-use for requests from the engine client to the engine server
  3. The overall engine-wide state (mostly various buildkit+containerd entities) is also centralized now rather than spread confusingly amongst many files, which is slightly tangential but supported the above efforts.

Details

Objects + state + naming

  • Server - formerly known as BuildkitController
    • This is not to be confused with the thing previously called Server, which was really more like the session state (and was thus confusing)
    • All the "global" state for various buildkit+containerd entities like snapshotters, various cache dbs, the solver, worker+executor, etc. Also top level state for which sessions currently exist
      • There's a lot in there, but I personally much prefer it in one place rather than spread all over.
    • Serves an HTTP API for gql queries, session attachables and shutdown, with requests scoped to the client based on clientMetadata (which is now sent in an http header). Code.
    • Still implements the BuildkitController API since we do have some reliance on ListWorker at least, though we are free to change any+all of those to core API (i.e. gql) calls at any point (just got a bit too out of scope here)
  • daggerSession and daggerClient
    • Basically what it says on the tin: the session-wide state for each session and the client-specific state for each client in a session
    • Does state tracking with an enum of possible states like uninitialized, initialized, deleted (not complicated enough to go full-on state machine, but this still makes it all more obvious and easy to follow I think, especially when it comes to locking the state for mutations)
    • I moved all the state that used to co-exist in core.Query and buildkit.Client to be in these structs too so there's less places to look+think about
    • One notable thing gone is ClientCallContext - instead of trying to register all of that we're "stateless" in that the module+function-call metadata for a client is just plumbed through ExecutionMetadata+ClientMetadata, following the request path rather than both that and a pre-registration side-channel
      • e.g. here's where the executor supplies the ClientMetadata it was plumbed in the requests made by the nested client
    • The logic for deciding when to end a session is now just "when the main client caller has no more active connections", at which point the session state is torn down and released. This is done by just incrementing and decrementing that count at the beginning/end of each http request

HTTP/2 Usage

  • I updated all the HTTP servers we create to explicitly support HTTP/2 (with h2c, aka no TLS), while also supporting HTTP/1 clients too
  • The main motivation here was:
    • We wanted to get rid of the gRPC tunneling (for simplicity, performance, detaching from the BuildkitController.Session API and it's associated complications, etc.)
    • But that meant that every time the http client needed to add to the connection pool it would have to invoke the connhelper (rather than open a new gRPC stream) which is an expensive operation for e.g. docker-container, kube-pod, etc. (spawns a subprocess)
    • HTTP/2 solves that problem though via stream multiplexing; go's http/2 client by default only needs a pool of 2 conns (one for reqs, one for resps) and can just multiplex everything from there
      • I briefly looked at HTTP/3 since just sending udp packets back and forth would be even simpler conceptually, but it's still too immature+low-level in the go ecosystem
  • This seems to have worked pretty seamlessly, other than one gotcha I hit where typescript tests using node 18 only were erroring out (fix with details here)
  • There is also still a need to serve some gRPC APIs for the few remaining buildkit controller APIs we use and OTel, which is done via gRPC http handlers
    • The docs on that suggest there are some missing advanced gRPC features (e.g. BFD, big frame detection, which is a performance optimization) but none of them have been obviously relevant to our use case. Utterly worst case scenario, there is a fallback option of serving http + grpc on separate listeners, but avoiding that complication unless proven 100% necessary
    • These can also be migrated to pure graphql/plain-http APIs as desired

Session Attachables

  • As mentioned above, session attachables no longer require 2 layers of gRPC tunnels; instead there's just a /sessionAttachables http endpoint, which the server hijacks and uses as a raw conn for establishing the gRPC streams
  • That "hijack and invert client/server relationship" process involves a small dance in order to be robust against accidentally mixing/overlapping http+gRPC traffic, which can, unsurprisingly, confuse the computer
    • Client-side and server-side implementation, with comments explaining. Basically just http req/resp + a 1 byte ack to synchronize the switch to gRPC
    • I wanted to use upstream buildkit's builtin SessionManager.HandleHTTP method, which is somewhat similar, but it didn't handle the switch from http->grpc synchronously and was resulting in data getting mixed sometimes
  • A nice side effect of this in combination with the session state simplifications is that we no longer need to do the whole "retry making a request to verify the session is working"
    • Instead, if we successfully connect these session attachables, we can know the session as a whole was successfully initialized and we can unblock client connect, returning to the caller
    • That had more nice side effects of reducing possibilities of race conditions when requesting the caller session in various server-side APIs

@sipsma
Copy link
Contributor Author

sipsma commented May 20, 2024

Ended up spinning out the support for serving nested execs from the executor here (ended up just becoming removal of the shim entirely). Coming back here now to finish up the rest of this refactor on top of that.

@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 0c8f915 to 1878470 Compare May 24, 2024 02:50
@sipsma sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from dd3c007 to a615f02 Compare June 4, 2024 19:22
@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 6f2b173 to 55a8e04 Compare June 5, 2024 05:59
@sipsma sipsma mentioned this pull request Jun 5, 2024
6 tasks
cmd/engine/main.go Outdated Show resolved Hide resolved
@sipsma sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from 9542792 to 77406e4 Compare June 6, 2024 20:55
@sipsma sipsma added this to the v0.11.7 milestone Jun 6, 2024
@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 927d700 to 2780ed7 Compare June 7, 2024 20:19
@sipsma sipsma modified the milestones: v0.11.7, next Jun 7, 2024
@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 3c2985e to 632daea Compare June 7, 2024 23:43
@sipsma sipsma changed the title WIP refactor of server/sessions/buildkit-interfaces Session Cleanup+Simplifications Jun 8, 2024
@sipsma sipsma marked this pull request as ready for review June 8, 2024 01:31
@sipsma sipsma requested review from vito and jedevc June 8, 2024 01:35
analytics/analytics.go Outdated Show resolved Hide resolved
@sipsma

This comment was marked as resolved.

@sipsma
Copy link
Contributor Author

sipsma commented Jun 8, 2024

@vito FYI I think I may have hit the theoretical flake you described in the OTEL PR:

1534

    client_test.go:104: 

1535

        	Error Trace:	/app/core/integration/client_test.go:104

1536

        	Error:      	Not equal: 

1537

        	            	expected: 1

1538

        	            	actual  : 0

1539

1540

        	Test:       	TestClientMultiSameTrace

(tangential feature request - easier to copy logs from the cloud traces output 😄)

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

@vito
Copy link
Contributor

vito commented Jun 8, 2024

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

Yep, that's the one. Sorry about that! I have an idea of how to fix it but it's a bit tricky. The problem is that trace and log data arrives independently, and we can't know whether a span has logs until we see logs for it for the first time, after which point we wait until EOF. But the test can still flake if we don't see the start of the logs before calling Close(). There's a echo hey; sleep 0.5 to try to counteract that, I suppose we could bump that sleep, but it might still flake under load.

In terms of an actual fix, I'm thinking we would need to set an attribute on the span to indicate that logs or at least an EOF should be consumed for it, and update the draining logic accordingly. But the problem is we don't control the span creation. We can man-in-the-middle it, and look for spans starting with exec I suppose, simillar to how we already man-in-the-middle the [internal] prefix into an attribute.

@sipsma

This comment was marked as resolved.

@sipsma
Copy link
Contributor Author

sipsma commented Jun 10, 2024

On a positive note, with the extra telemetry draining fix commits appended to this PR, I'm seeing full engine tests take as low as 9 minutes, which is the fastest I've seen in a very long time 🎉

@sipsma

This comment was marked as resolved.

engine/client/client.go Outdated Show resolved Hide resolved
Copy link
Contributor

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits/questions!

@@ -111,7 +111,7 @@ func (c *Client) diffcopy(ctx context.Context, opts engine.LocalImportOpts, msg

ctx = opts.AppendToOutgoingContext(ctx)

clientCaller, err := c.GetSessionCaller(ctx, true)
clientCaller, err := c.GetSessionCaller(ctx, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this true becoming false - is it somehow guaranteed that it'll be available immediately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, previously the buildkit session attachables got connected in parallel with gql requests, but now the whole session is initialized by a synchronous request to connect the buildkit session attachables, so it's not possible to have a race here anymore.

I went ahead and changed these values because at one point I had a bug that caused the attachables to fail, but it manifested as these lines deadlocking. Fixed the bug obviously but now if something goes wrong in the future it will be an error rather than deadlock 🙂

}}

// there are fast retries server-side so we can start out with a large interval here
if err := retry(ctx, 3*time.Second, func(elapsed time.Duration, ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice seeing this retry logic going away. Guessing it's just not needed anymore, for reasons similar to the other comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, once we've successfully made any request from a given client in a given session, we know that the client+session state are fully initialized. So because we have already made a request successfully at this point (to hook up the attachables) we know we're good to go and don't need this "retry a request" stuff anymore.

return nil
}

func ConnectBuildkitSession(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always gonna read this as "burger king session" and I'm OK with that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 I had the same thought while writing it, also OK with it

engine/server/session.go Outdated Show resolved Hide resolved
engine/server/session.go Outdated Show resolved Hide resolved
engine/buildkit/cleanup.go Outdated Show resolved Hide resolved
engine/buildkit/cleanup.go Outdated Show resolved Hide resolved
core/container_exec.go Outdated Show resolved Hide resolved
There's an enormous amount of buildkit+containerd entities like
snapshotters, various cache dbs, the solver, worker+executor, etc.
which are all important to understand, but previously they were all
setup all over the place which was extremely confusing to follow.

This consolidates them all to be setup and stored in one place. There's
quite a bit there, but at least you don't have to search far and wide to
know what exists and how it's configured.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
This commit focuses on making sessions + associated state management
simpler:

* More comprehensible+centralized state management
  * Rather than spread all over the place and tied together in random
    places, all of the state associated with a given session is in a
    daggerSession object and all of the state associated with a given client
    in a session is a daggerClient object
  * The code is also a lot more structured and "boring" in terms of
    locking/mutating state/etc. Not a rube goldberg machine anymore
  * The whole "pre-register a nested client's state before it calls", which
    was a fountain of confusion and bugs, is gone.
* No more insane gRPC tunneling, the engine API is just an HTTP server now
  * graphQL http requests are just that, don't have to tunnel them through
    gRPC streams
  * session attachables are still gRPC based, but over a hijacked http conn
    (as opposed to a gRPC stream embedded in another gRPC stream)
  * This allowed us to move off the Session method from buildkit's upstream
    controller interface
  * That in turn let us delete huge chunks of complicated code around
    handing conns (i.e. engine/server/conn.go) and no longer need to be
    paranoid about gRPC max message limits in as many places

There are more details in the PR description (7315).

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
The --experimental-privileged-nesting flag was broken when used with
terminal due to a panic around registering clients.

This was fixed by commits before this one which completely removed the
need to register clients, but backfilling the coverage now.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Otherwise we only use the otel spans from the initial client connect,
not any per-request config.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma sipsma merged commit 0aa1af5 into dagger:main Jun 12, 2024
105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants