Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add telemetry to VNet in Connect #41587

Open
wants to merge 10 commits into
base: r7s/teleterm-vnet
Choose a base branch
from
Open

Conversation

ravicious
Copy link
Member

@ravicious ravicious commented May 15, 2024

This PR makes it so that tshd reports a usage event once per each app accessed through VNet, as described in the RFD.

To accomplish this, I added a new method to the AppProvider interface called OnNewConnection. That method gets called by a new LocalProxy middleware that also wraps around client.CertChecker.

The implementation does not yet respect the usageReporting.enabled config setting of Connect. Support for this will be added in a subsequent PR, as that problem is not related to VNet.

connect.protocol.use in staging PostHog.

Challanges around sending usage events straight from lib/teleterm

Typically, Connect usage events are sent from the Electron app through the ReportUsageEvent RPC. To send a usage event, one must include four additional pieces of information in addition to the event itself: the cluster ID from the auth cluster, the installation ID (a UUID generated locally for each system user on first Connect start), the actual name of the root cluster and the cluster username.

Those details are gathered from various places by the Electron app. They are persisted in the app's state, so that when the time comes to report an event, the Electron app just plucks it out of there. tsh daemon did not store this info in state anywhere.

Cluster ID

This is the piece of information that was most difficult to pass to the callsite that submits the VNet usage event.

The Electron app typically gets cluster IDs asynchronously during startup or when logging in to a cluster. During those actions, the Electron app requests details of each root cluster and tsh daemon extracts the cluster ID the auth service:

clusterName, err := authClient.GetClusterName()
if err != nil {
return trace.Wrap(err)
}
authClusterID = clusterName.GetClusterID()

It'd be best if tsh daemon had a separate cache with cluster IDs that reaches out to the cluster if a cluster ID is missing. To save time however, I decided to make create a cache that's shared between teleterm/daemon.Service and teleterm/vnet.Service. Whenever the daemon service receives the RPC to fetch cluster details, it updates the cache. The VNet service reads from the cache whenever it needs to submit a usage events.

You may ask, but what if the VNet service submits a usage event before the daemon service fetches the details? In that case, the VNet service will drop the usage event and let the VNet connection through. (Un)fortunately, it's not a new problem – the same design flaw exists in the Electron app today. In practice, it's not likely to happen in day-to-day use, as Connect waits for full cluster details to be synced on login. The situation where the issue would be most likely to surface is that if the user opens Connect with already valid user certs and immediately reopens the previous session.

@ravicious ravicious added the no-changelog Indicates that a PR does not require a changelog entry label May 15, 2024
@github-actions github-actions bot added size/md tsh tsh - Teleport's command line tool for logging into nodes running Teleport. ui labels May 15, 2024
@ravicious ravicious mentioned this pull request May 15, 2024
1 task
It will be needed to report usage events straight from tsh daemon.
It used to be available only in the Electron app which sent this ID
with every ReportUsageEvent RPC.
This way if initializing a service fails, we don't create a listener
unnecessarily.
go func() {
uri := uri.NewClusterURI(profileName).AppendLeafCluster(leafClusterName).AppendApp(app.GetName())

err := p.usageReporter.ReportApp(ctx, uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to use a different context in case this is a very short-lived tcp connection and the context expires very quickly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, I just realized that in a regular local proxy, the context that is passed is the context behind the local proxy, while here it's the context behind the connection.

What's the best practice here? My first thought was a background context with an arbitrary timeout, but it feels like it'd need more coordination, e.g., the usage reporter should cancel all such contexts when VNet is getting shut down.

Copy link
Member Author

@ravicious ravicious May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of adding a WaitGroup and a close channel. VNet service would call usageReported.Close after processManager.Wait(), this in turn would close the close channel and wait for the WaitGroup. Closing close would cancel all contexts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually need a WaitGroup. I forgot that only one ReportApp can be active at a time, so there's ever only one context to cancel.

lib/teleterm/vnet/service.go Outdated Show resolved Hide resolved
lib/teleterm/vnet/service_test.go Outdated Show resolved Hide resolved
lib/vnet/vnet_test.go Outdated Show resolved Hide resolved
@ravicious ravicious requested a review from nklaassen May 27, 2024 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry size/md tsh tsh - Teleport's command line tool for logging into nodes running Teleport. ui
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants