xds: refactor server code in preparation for a move #50515

kyessenov · 2024-04-17T20:06:35Z

Add several interfaces to encapsulate push context, generators, proxy, etc, that are not needed by SDS server.

Issue: #50134

Change-Id: I4049d536a21f40f4b8e612865a9dbb9bd64bd7d3 Signed-off-by: Kuat Yessenov <kuat@google.com>

Change-Id: Icda9014d7637ea3b408093278a570fb2e0614926

Change-Id: Iafe4ccf1fd7371cc04afba0fc0a85d03acc04a46 Signed-off-by: Kuat Yessenov <kuat@google.com>

Change-Id: I60aae2394bc259117aaf7c36ea894f25a08acb86 Signed-off-by: Kuat Yessenov <kuat@google.com>

Change-Id: If826ebd48bcf813c42660eadef13557be9c45e90

kyessenov · 2024-04-17T20:40:45Z

/retest

howardjohn

In preperation for what?

Its hard to review refactorings like this without a sense of where things are going

kyessenov · 2024-04-17T20:41:58Z

@howardjohn All the modified exported functions will be moved to pkg/xds. That's all the code needed for top-level Stream(). It's easier to review in-place before they are moved to pkg/xds.

howardjohn · 2024-04-17T20:49:02Z

What does top level Stream() give us? What dependency are we trying to remove from where?

Is this for isolating SDS or xdsproxy?

kyessenov · 2024-04-17T20:50:33Z

@howardjohn This is for SDS - the issue is linked. xDS proxy never really shared code with pilot - that's not relevant. SDS only needs the basic RPC scaffold - the push context and connection tracking is just overhead for SDS where it's only one client.

howardjohn · 2024-04-17T20:51:55Z

I had thought you decided against doing the SDS work was the main source of confusion. So you are going to make the existing xDS server able to be used without dependencies on rest of Istio for usage in SDS?

kyessenov · 2024-04-17T20:53:20Z

@howardjohn I didn't decide anything. This is just going to be hard - but doable if we split it right in pilot.

howardjohn · 2024-04-17T20:55:12Z

Ack, agree its hard. I thought the previous conclusion was "its hard; we will just disable SDS for our build". If it is instead "its hard; we will do a lot of work to make it happen" that is great.

Have you proofed out that there is a viable path towards doing it? So we don't do a bunch of refactoring only to find a few bits are infeasible to isolate?

kyessenov · 2024-04-17T20:58:06Z

I didn't validate end-to-end - but it looks pretty viable. We can hold on until I prepare the full change, and then stage the PRs.

Change-Id: I2238d9699f97ad267c79d20a92f43ee4aec421ea Signed-off-by: Kuat Yessenov <kuat@google.com>

kyessenov · 2024-04-18T00:15:19Z

Pushed the whole thing here.

kyessenov · 2024-04-18T00:22:41Z

Obviously, not complete since tests fail :) But you get the idea. The final binary size is 28MB.

Change-Id: Idfee89938763d8c1c053a7ad02b533dea12ec2e2 Signed-off-by: Kuat Yessenov <kuat@google.com>

istio-testing · 2024-04-18T01:18:24Z

@kyessenov: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
unit-tests-arm64_istio	`2744ded`	link	true	`/test unit-tests-arm64`
lint_istio	`2744ded`	link	true	`/test lint`
gencheck_istio	`2744ded`	link	true	`/test gencheck`
unit-tests_istio	`2744ded`	link	true	`/test unit-tests`
integ-pilot_istio	`2744ded`	link	true	`/test integ-pilot`
integ-cni_istio	`2744ded`	link	true	`/test integ-cni`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ramaraochavali · 2024-04-18T03:22:07Z

May be we should do the end state PR in another PR using these changes to validate and we can review this PR first if every thing is found to be viable?

kyessenov · 2024-04-18T05:10:27Z

May be we should do the end state PR in another PR using these changes to validate and we can review this PR first if every thing is found to be viable?

Yes, that is the plan. Once John confirms, I'll revert back and review them in pieces.

howardjohn

Overall looks pretty good. A bit harder to follow the code due to the indirections/abstractions but that is pretty unavoidable, and I think its pretty reasonable.

We drop like 7mb off pilot-agent which is great

Since this is probably the most critical part of Istio code want to make sure we get thorough review from @hzxuzhonghu @ramaraochavali @keithmattix as well before moving forward on this (and myself -- I went through most of it but could use a deeper look)

howardjohn · 2024-04-18T14:26:58Z

pilot/pkg/xds/ads.go

-		return nil
-	}
+	con.ids = ids
+	con.s = s


Connection having a reference to the DiscoveryServer seems suspicious

This is a way around defining a discovery server interface. The context has all the info. I don't think it adds any circular data structures.

howardjohn · 2024-04-18T14:28:21Z

pkg/xds/server.go

+		// start := time.Now()
+		// XXX defer func() { recordSendTime(time.Since(start)) }()


Probably obvious by the XXX, but need to make sure we can add this back

Yes, logging/metrics needs to be improved in the final version.

howardjohn · 2024-04-18T14:32:59Z

security/pkg/nodeagent/sds/sdsservice.go

+	s.Lock()
+	defer s.Unlock()


Do we need to hold the lock this entire time? seems like a hung PushChannel could block everything forever.

oh nvm, we go func it

howardjohn · 2024-04-18T14:39:59Z

security/pkg/nodeagent/sds/sdsservice.go

+func (c *Context) Watcher() xds.Watcher {
+	return c.w
+}
+func (c *Watch) DeleteWatchedResource(string) {


nit: maybe add some assertions the typeurl is always secret?

Yeah, will do.

howardjohn · 2024-04-18T15:05:19Z

pkg/xds/server.go

+	"istio.io/istio/pkg/util/sets"
+)
+
+var log = istiolog.RegisterScope("xdsserver", "xds debugging")


nit: we might consider using the old ads log scope here... normally I wouldn't care much, but these are the most core logs in Istio, so a lot of people have tooling around processing these. It would also be consistent with delta, which is still all under the delta scope

Sure, I can move the scope here.

howardjohn · 2024-04-18T15:07:11Z

security/pkg/nodeagent/sds/sdsservice.go

+	c.watch = f(c.watch)
+}
+func (c *Watch) GetID() string {
+	return ""


can we return a useful ID here? Maybe we can use the fact we know we only subscribe to 1 resource to make the resource name the log, which would be super handy. Although for the initial connect it would be not yet available

I'm worried that grabbing a lock can cause issues so the list of resources is probably not easy. Not sure what to put here.

pilot/pkg/model/context.go

costinm · 2024-04-19T19:45:21Z

My personal preference remains on simplifying and not having an SDS server in agent just to load a file.

Envoy using SDS for 3P integrations or to get ingress secrets from Istiod is fine - but running an XDS server and all this complexity for a single file seems too much.

kyessenov · 2024-04-19T20:41:38Z

My personal preference remains on simplifying and not having an SDS server in agent just to load a file.

Envoy using SDS for 3P integrations or to get ingress secrets from Istiod is fine - but running an XDS server and all this complexity for a single file seems too much.

Agree, but @howardjohn has strong concerns about changing the code structurally. This is a refactor, so it's much less risky - there's always a risk but the current architecture is understood better.

costinm · 2024-04-19T21:07:32Z

On Fri, Apr 19, 2024, 15:42 Kuat ***@***.***> wrote: My personal preference remains on simplifying and not having an SDS server in agent just to load a file. Envoy using SDS for 3P integrations or to get ingress secrets from Istiod is fine - but running an XDS server and all this complexity for a single file seems too much. Agree, but @howardjohn <https://github.com/howardjohn> has strong concerns about changing the code structurally. This is a refactor, so it's much less risky - there's always a risk but the current architecture is understood better.

I don't know how well understood it is - or if it's 'architecture' or a long series of code changes that may be long obsolete. If I remember correctly we started with files, and switched to SDS because Envoy didn't support reloading files. That's no longer the case. It was also for Ingress with mounted files - but we moved to Istiod dealing with the Secrets. If someone knows any reason still valid for this 'architecture' - would be great if more people undertood it..

…

Message ID: ***@***.***>

costinm · 2024-04-19T21:09:57Z

I'm not against this change - just trying to understand.

…

On Fri, Apr 19, 2024, 16:07 Costin Manolache ***@***.***> wrote: On Fri, Apr 19, 2024, 15:42 Kuat ***@***.***> wrote: > My personal preference remains on simplifying and not having an SDS > server in agent just to load a file. > > Envoy using SDS for 3P integrations or to get ingress secrets from Istiod > is fine - but running an XDS server and all this complexity for a single > file seems too much. > > Agree, but @howardjohn <https://github.com/howardjohn> has strong > concerns about changing the code structurally. This is a refactor, so it's > much less risky - there's always a risk but the current architecture is > understood better. > I don't know how well understood it is - or if it's 'architecture' or a long series of code changes that may be long obsolete. If I remember correctly we started with files, and switched to SDS because Envoy didn't support reloading files. That's no longer the case. It was also for Ingress with mounted files - but we moved to Istiod dealing with the Secrets. If someone knows any reason still valid for this 'architecture' - would be great if more people undertood it.. > > Message ID: ***@***.***> >

kyessenov · 2024-04-19T21:15:01Z

@costinm We should wait for k8s to define the standard mounting point for SPIFFE certs. There's a linked KEP. Once that is adopted, that's a strong reason to use files directly, and integrate MeshCA, etc with k8s directly, without SDS or Istio. Simply changing Istio to write files locally feels like busy work.

costinm · 2024-04-19T21:43:18Z

I suspect writing files ( which we already support through one of the env vars - for proxyless) is far smaller PR than refactoring SDS. And the work is moving us in the right direction. The K8S KEP is not that far away from what CertManager support or is easily emulated - even if it is approved it'll take a long time to make it in all supported k8s versions. By that time we'll probably have ztunnel and sandwiches making it mostly irrelevant.

I'm also not sure k8s mounted certs will be in a form envoy will reload - could you confirm ? I thought reload works with the certs in some json form that looks like sds on disk - maybe that changed.

costinm · 2024-04-19T21:45:00Z

I was going to say it may be relevant for ztunnel only, but of course it may not work with the current architecture....

howardjohn · 2024-04-19T21:47:24Z

Its not just reading files. It handles translating certificate files to envoy formats (envoy doesn't read raw .pem), abstracting "system" certificates which vary per platform, various custom private key providers (cryptomb, etc), merging multiple trust bundles and dynamically updating them, ...

Also, there has been some desire to not store private key material on disk.

istio-testing · 2024-04-20T22:03:19Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

costinm · 2024-04-21T00:50:02Z

On Fri, Apr 19, 2024, 16:47 John Howard ***@***.***> wrote: Its not just reading files. It handles translating certificate files to envoy formats (envoy doesn't read raw .pem), abstracting "system" certificates which vary per platform, various custom private key providers (cryptomb, etc), merging multiple trust bundles and dynamically updating them, ...

Sure, but for 99% of users it's about reading a cert from disk. Envoy has some weirdness - the rest is Istio invented complexity. If you look a t proxyless gRPC and most TLS uses outside Istio - it's just files. Not saying we should stop supporting SDS in envoy and as integration point - but we can have a simpler agent for the common case, without the insane complexity for use cases that are rarely needed ( as evidenced by the fact everyone else can use simpler files for certs) Also, there has been some desire to not store private key material on disk.

And that remains possible - as an integration. Envoy can still use SDS - my suggestion is to just remove it from the agent ( or to have an alternative agent that is small and simple). For example Envoy Gateway doesn't even have an agent.

…

.Message ID: ***@***.***>

ramaraochavali · 2024-04-22T10:18:12Z

There are file mounted certs use case that looks at various user configured locations reads/updates when they change, validate them before sending to Envoy etc etc. As @howardjohn mentioned it is not just simple read of files.

ramaraochavali · 2024-04-22T10:19:19Z

It was
also for Ingress with mounted files - but we moved to Istiod dealing with
the Secrets.

This is still a very valid usecase for many companies like us

hzxuzhonghu · 2024-04-22T12:06:58Z

And that remains possible - as an integration. Envoy can still use SDS - my
suggestion is to just remove it from the agent ( or to have an alternative
agent that is small and simple).

For example Envoy Gateway doesn't even have an agent.

That would break many things. EnvoyGateway is simpler compared with istio. I think in EG there is no cert signing, just pushing user provided certificates is enough

kyessenov · 2024-04-22T16:50:36Z

@hzxuzhonghu @ramaraochavali Are we OK with going forward with this refactor? I don't see a problem with adding a simplified file based SDS later on per @costinm, it'd be a new feature anyways. We can do it as soon as the branch is cut.

howardjohn

If the goal is to have a reduced binary size, there doesn't seem to be much point in having more options.

small+large=large in binary size

kyessenov · 2024-04-22T17:40:46Z

@howardjohn It depends, the current proposal that gRPC implemented uses code in Envoy to retrieve certificates directly (without SDS at all). That would reduce the binary size. Again, there's no final design for this right now.

costinm · 2024-04-22T18:46:06Z

Rama - there are many use cases, but they are better addressed using the extensibility mechanisms and a custom provider. CertManager, Spire, new K8S cert proposal, etc do provide support for a large number of use cases. Spire ( and others ) may provide a SDS socket - the others will provide files. If you need something even more custom - it is possible to have a daemon set or extend one of the providers. Istio has a minimal CA - with far from ideal security model and feature set - for the common use case. We added a ton of shadow API ( annotations, env, MeshConfig etc) that allow a mostly undocumented set of use cases - plus a lot of complicated code in the agent to support them. Most of this will probably break anyway with ambient. Having an alternative - where Envoy only loads files ( provided by a K8S CSI-like provider - CertManager, etc ) or by a SDS provider ( Spire, custom ones, etc ) - and no agent-related SDS - would reduce complexity and provide a more clear API surface. With conditional compilation or an alternative clean agent users can avoid a lot of dependencies and risks - it's not about binary size, but complexity, support, stability and interoperability.

…

On Mon, Apr 22, 2024 at 10:41 AM Kuat ***@***.***> wrote: @howardjohn <https://github.com/howardjohn> It depends, the current proposal that gRPC implemented uses code in Envoy to retrieve certificates directly (without SDS at all). That would reduce the binary size. Again, there's no final design for this right now. — Reply to this email directly, view it on GitHub <#50515 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAUR2WOMO7B6QKAQK4MQILY6VDTJAVCNFSM6AAAAABGL7B2O2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQGM4DKOJXHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kyessenov added 3 commits April 16, 2024 23:48

wip

2028e85

Change-Id: I4049d536a21f40f4b8e612865a9dbb9bd64bd7d3 Signed-off-by: Kuat Yessenov <kuat@google.com>

Merge remote-tracking branch 'upstream/master' into xdsserver

1288511

Change-Id: Icda9014d7637ea3b408093278a570fb2e0614926

xds: refactor server

4836551

Change-Id: Iafe4ccf1fd7371cc04afba0fc0a85d03acc04a46 Signed-off-by: Kuat Yessenov <kuat@google.com>

kyessenov requested review from a team as code owners April 17, 2024 20:06

istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 17, 2024

kyessenov added the release-notes-none Indicates a PR that does not require release notes. label Apr 17, 2024

kyessenov added 2 commits April 17, 2024 20:20

fix

bbd0dba

Change-Id: I60aae2394bc259117aaf7c36ea894f25a08acb86 Signed-off-by: Kuat Yessenov <kuat@google.com>

Merge remote-tracking branch 'upstream/master' into xdsserver

bc42bd6

Change-Id: If826ebd48bcf813c42660eadef13557be9c45e90

howardjohn reviewed Apr 17, 2024

View reviewed changes

test

9824fbe

Change-Id: I2238d9699f97ad267c79d20a92f43ee4aec421ea Signed-off-by: Kuat Yessenov <kuat@google.com>

kyessenov requested a review from a team as a code owner April 18, 2024 00:14

istio-testing added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 18, 2024

fix race

2744ded

Change-Id: Idfee89938763d8c1c053a7ad02b533dea12ec2e2 Signed-off-by: Kuat Yessenov <kuat@google.com>

howardjohn reviewed Apr 18, 2024

View reviewed changes

keithmattix reviewed Apr 18, 2024

View reviewed changes

pilot/pkg/model/context.go Show resolved Hide resolved

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Apr 20, 2024

howardjohn reviewed Apr 22, 2024

View reviewed changes

kyessenov mentioned this pull request Apr 23, 2024

xDS server: define interfaces #50652

Merged

kyessenov closed this May 2, 2024

		// start := time.Now()
		// XXX defer func() { recordSendTime(time.Since(start)) }()

xds: refactor server code in preparation for a move #50515

xds: refactor server code in preparation for a move #50515

Conversation

kyessenov commented Apr 17, 2024 • edited

kyessenov commented Apr 17, 2024

howardjohn left a comment

Choose a reason for hiding this comment

kyessenov commented Apr 17, 2024

howardjohn commented Apr 17, 2024

kyessenov commented Apr 17, 2024 • edited

howardjohn commented Apr 17, 2024

kyessenov commented Apr 17, 2024

howardjohn commented Apr 17, 2024

kyessenov commented Apr 17, 2024

kyessenov commented Apr 18, 2024

kyessenov commented Apr 18, 2024 • edited

istio-testing commented Apr 18, 2024

ramaraochavali commented Apr 18, 2024

kyessenov commented Apr 18, 2024

howardjohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyessenov Apr 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costinm commented Apr 19, 2024

kyessenov commented Apr 19, 2024

costinm commented Apr 19, 2024 via email

costinm commented Apr 19, 2024 via email

kyessenov commented Apr 19, 2024 • edited

costinm commented Apr 19, 2024

costinm commented Apr 19, 2024

howardjohn commented Apr 19, 2024

istio-testing commented Apr 20, 2024

costinm commented Apr 21, 2024 via email

ramaraochavali commented Apr 22, 2024

ramaraochavali commented Apr 22, 2024

hzxuzhonghu commented Apr 22, 2024

kyessenov commented Apr 22, 2024

howardjohn left a comment

Choose a reason for hiding this comment

kyessenov commented Apr 22, 2024

costinm commented Apr 22, 2024 via email

kyessenov commented Apr 17, 2024 •

edited

kyessenov commented Apr 17, 2024 •

edited

kyessenov commented Apr 18, 2024 •

edited

kyessenov Apr 18, 2024 •

edited

kyessenov commented Apr 19, 2024 •

edited