-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add telemetry to VNet in Connect #41587
base: r7s/teleterm-vnet
Are you sure you want to change the base?
Conversation
abe1d11
to
7417392
Compare
9e3ca78
to
1dc9fb2
Compare
1dc9fb2
to
fb0c745
Compare
It will be needed to report usage events straight from tsh daemon. It used to be available only in the Electron app which sent this ID with every ReportUsageEvent RPC.
This way if initializing a service fails, we don't create a listener unnecessarily.
fb0c745
to
a042be1
Compare
lib/teleterm/vnet/service.go
Outdated
go func() { | ||
uri := uri.NewClusterURI(profileName).AppendLeafCluster(leafClusterName).AppendApp(app.GetName()) | ||
|
||
err := p.usageReporter.ReportApp(ctx, uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to use a different context in case this is a very short-lived tcp connection and the context expires very quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, I just realized that in a regular local proxy, the context that is passed is the context behind the local proxy, while here it's the context behind the connection.
What's the best practice here? My first thought was a background context with an arbitrary timeout, but it feels like it'd need more coordination, e.g., the usage reporter should cancel all such contexts when VNet is getting shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of adding a WaitGroup and a close
channel. VNet service would call usageReported.Close
after processManager.Wait()
, this in turn would close the close
channel and wait for the WaitGroup. Closing close
would cancel all contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually need a WaitGroup. I forgot that only one ReportApp
can be active at a time, so there's ever only one context to cancel.
This PR makes it so that tshd reports a usage event once per each app accessed through VNet, as described in the RFD.
To accomplish this, I added a new method to the
AppProvider
interface calledOnNewConnection
. That method gets called by a newLocalProxy
middleware that also wraps aroundclient.CertChecker
.The implementation does not yet respect the
usageReporting.enabled
config setting of Connect. Support for this will be added in a subsequent PR, as that problem is not related to VNet.connect.protocol.use
in staging PostHog.Challanges around sending usage events straight from lib/teleterm
Typically, Connect usage events are sent from the Electron app through the
ReportUsageEvent
RPC. To send a usage event, one must include four additional pieces of information in addition to the event itself: the cluster ID from the auth cluster, the installation ID (a UUID generated locally for each system user on first Connect start), the actual name of the root cluster and the cluster username.Those details are gathered from various places by the Electron app. They are persisted in the app's state, so that when the time comes to report an event, the Electron app just plucks it out of there. tsh daemon did not store this info in state anywhere.
Cluster ID
This is the piece of information that was most difficult to pass to the callsite that submits the VNet usage event.
The Electron app typically gets cluster IDs asynchronously during startup or when logging in to a cluster. During those actions, the Electron app requests details of each root cluster and tsh daemon extracts the cluster ID the auth service:
teleport/lib/teleterm/clusters/cluster.go
Lines 130 to 134 in 7c55bc3
It'd be best if tsh daemon had a separate cache with cluster IDs that reaches out to the cluster if a cluster ID is missing. To save time however, I decided to make create a cache that's shared between
teleterm/daemon.Service
andteleterm/vnet.Service
. Whenever the daemon service receives the RPC to fetch cluster details, it updates the cache. The VNet service reads from the cache whenever it needs to submit a usage events.You may ask, but what if the VNet service submits a usage event before the daemon service fetches the details? In that case, the VNet service will drop the usage event and let the VNet connection through. (Un)fortunately, it's not a new problem – the same design flaw exists in the Electron app today. In practice, it's not likely to happen in day-to-day use, as Connect waits for full cluster details to be synced on login. The situation where the issue would be most likely to surface is that if the user opens Connect with already valid user certs and immediately reopens the previous session.