Add telemetry to VNet in Connect #41587

ravicious · 2024-05-15T14:33:01Z

This PR makes it so that tshd reports a usage event once per each app accessed through VNet, as described in the RFD.

To accomplish this, I added a new method to the AppProvider interface called OnNewConnection. That method gets called by a new LocalProxy middleware that also wraps around client.CertChecker.

The implementation does not yet respect the usageReporting.enabled config setting of Connect. Support for this will be added in a subsequent PR, as that problem is not related to VNet.

connect.protocol.use in staging PostHog.

Challanges around sending usage events straight from lib/teleterm

Typically, Connect usage events are sent from the Electron app through the ReportUsageEvent RPC. To send a usage event, one must include four additional pieces of information in addition to the event itself: the cluster ID from the auth cluster, the installation ID (a UUID generated locally for each system user on first Connect start), the actual name of the root cluster and the cluster username.

Those details are gathered from various places by the Electron app. They are persisted in the app's state, so that when the time comes to report an event, the Electron app just plucks it out of there. tsh daemon did not store this info in state anywhere.

Cluster ID

This is the piece of information that was most difficult to pass to the callsite that submits the VNet usage event.

The Electron app typically gets cluster IDs asynchronously during startup or when logging in to a cluster. During those actions, the Electron app requests details of each root cluster and tsh daemon extracts the cluster ID the auth service:

teleport/lib/teleterm/clusters/cluster.go

Lines 130 to 134 in 7c55bc3

    
           clusterName, err := authClient.GetClusterName() 
        
           if err != nil { 
        
           	return trace.Wrap(err) 
        
           } 
        
           authClusterID = clusterName.GetClusterID()

It'd be best if tsh daemon had a separate cache with cluster IDs that reaches out to the cluster if a cluster ID is missing. To save time however, I decided to make create a cache that's shared between teleterm/daemon.Service and teleterm/vnet.Service. Whenever the daemon service receives the RPC to fetch cluster details, it updates the cache. The VNet service reads from the cache whenever it needs to submit a usage events.

You may ask, but what if the VNet service submits a usage event before the daemon service fetches the details? In that case, the VNet service will drop the usage event and let the VNet connection through. (Un)fortunately, it's not a new problem – the same design flaw exists in the Electron app today. In practice, it's not likely to happen in day-to-day use, as Connect waits for full cluster details to be synced on login. The situation where the issue would be most likely to surface is that if the user opens Connect with already valid user certs and immediately reopens the previous session.

It will be needed to report usage events straight from tsh daemon. It used to be available only in the Electron app which sent this ID with every ReportUsageEvent RPC.

This way if initializing a service fails, we don't create a listener unnecessarily.

nklaassen · 2024-05-24T01:43:51Z

lib/teleterm/vnet/service.go

+	go func() {
+		uri := uri.NewClusterURI(profileName).AppendLeafCluster(leafClusterName).AppendApp(app.GetName())
+
+		err := p.usageReporter.ReportApp(ctx, uri)


you might want to use a different context in case this is a very short-lived tcp connection and the context expires very quickly

Hmm yeah, I just realized that in a regular local proxy, the context that is passed is the context behind the local proxy, while here it's the context behind the connection.

What's the best practice here? My first thought was a background context with an arbitrary timeout, but it feels like it'd need more coordination, e.g., the usage reporter should cancel all such contexts when VNet is getting shut down.

I'm thinking of adding a WaitGroup and a close channel. VNet service would call usageReported.Close after processManager.Wait(), this in turn would close the close channel and wait for the WaitGroup. Closing close would cancel all contexts.

I didn't actually need a WaitGroup. I forgot that only one ReportApp can be active at a time, so there's ever only one context to cancel.

lib/teleterm/vnet/service.go

lib/teleterm/vnet/service_test.go

lib/vnet/vnet_test.go

ravicious added the no-changelog Indicates that a PR does not require a changelog entry label May 15, 2024

ravicious requested review from nklaassen and gzdunek May 15, 2024 14:33

github-actions bot added size/md tsh tsh - Teleport's command line tool for logging into nodes running Teleport. ui labels May 15, 2024

github-actions bot requested review from rudream and ryanclark May 15, 2024 14:33

ravicious removed request for ryanclark and rudream May 15, 2024 14:35

ravicious mentioned this pull request May 15, 2024

VNet (epic) #39330

Open

1 task

gzdunek approved these changes May 16, 2024

View reviewed changes

ravicious force-pushed the r7s/teleterm-vnet branch from abe1d11 to 7417392 Compare May 16, 2024 09:48

ravicious force-pushed the r7s/telemetry branch from 827497d to 74d8b2d Compare May 16, 2024 09:49

ravicious mentioned this pull request May 16, 2024

Check usageReporting.enabled before enabling VNet telemetry #41641

Open

ravicious force-pushed the r7s/teleterm-vnet branch from 9e3ca78 to 1dc9fb2 Compare May 21, 2024 16:53

ravicious force-pushed the r7s/telemetry branch from 74d8b2d to bd7a8ac Compare May 21, 2024 16:54

ravicious force-pushed the r7s/teleterm-vnet branch from 1dc9fb2 to fb0c745 Compare May 22, 2024 09:29

ravicious force-pushed the r7s/telemetry branch from bd7a8ac to 0b9fad7 Compare May 22, 2024 09:29

ravicious added 6 commits May 23, 2024 11:37

Add OnNewConnection to AppProvider

52787ab

Cache cluster IDs

d04ba7f

Pass installation ID to tsh daemon

a435acb

It will be needed to report usage events straight from tsh daemon. It used to be available only in the Electron app which sent this ID with every ReportUsageEvent RPC.

apiserver.New: Create listener after services

9d218a2

This way if initializing a service fails, we don't create a listener unnecessarily.

GetCachedClient: Rename argument to clarify usage

31881ed

Report usage event on VNet connection

bf171d4

ravicious force-pushed the r7s/teleterm-vnet branch from fb0c745 to a042be1 Compare May 23, 2024 10:30

ravicious force-pushed the r7s/telemetry branch from 0b9fad7 to bf171d4 Compare May 23, 2024 10:31

nklaassen reviewed May 24, 2024

View reviewed changes

ravicious added 2 commits May 24, 2024 17:59

Remove debug log when app was already reported

70e33ba

Reuse context in tests

1e8ba4c

Extract fake web proxy setup to separate function

6377f7e

nklaassen mentioned this pull request May 24, 2024

[vnet][5] app proxying #41033

Merged

usageReporter.ReportApp: Use background context

59c6ef1

ravicious requested a review from nklaassen May 27, 2024 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add telemetry to VNet in Connect #41587

Add telemetry to VNet in Connect #41587

ravicious commented May 15, 2024 •

edited

nklaassen May 24, 2024

ravicious May 24, 2024

ravicious May 24, 2024 •

edited

ravicious May 27, 2024

	clusterName, err := authClient.GetClusterName()
	if err != nil {
	return trace.Wrap(err)
	}
	authClusterID = clusterName.GetClusterID()

Add telemetry to VNet in Connect #41587

Are you sure you want to change the base?

Add telemetry to VNet in Connect #41587

Conversation

ravicious commented May 15, 2024 • edited

Challanges around sending usage events straight from lib/teleterm

Cluster ID

nklaassen May 24, 2024

Choose a reason for hiding this comment

ravicious May 24, 2024

Choose a reason for hiding this comment

ravicious May 24, 2024 • edited

Choose a reason for hiding this comment

ravicious May 27, 2024

Choose a reason for hiding this comment

ravicious commented May 15, 2024 •

edited

ravicious May 24, 2024 •

edited