Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: refactor server code in preparation for a move #50515

Closed
wants to merge 7 commits into from

Conversation

kyessenov
Copy link
Contributor

@kyessenov kyessenov commented Apr 17, 2024

Add several interfaces to encapsulate push context, generators, proxy, etc, that are not needed by SDS server.

Issue: #50134

Change-Id: I4049d536a21f40f4b8e612865a9dbb9bd64bd7d3
Signed-off-by: Kuat Yessenov <kuat@google.com>
Change-Id: Icda9014d7637ea3b408093278a570fb2e0614926
Change-Id: Iafe4ccf1fd7371cc04afba0fc0a85d03acc04a46
Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov kyessenov requested review from a team as code owners April 17, 2024 20:06
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 17, 2024
@kyessenov kyessenov added the release-notes-none Indicates a PR that does not require release notes. label Apr 17, 2024
Change-Id: I60aae2394bc259117aaf7c36ea894f25a08acb86
Signed-off-by: Kuat Yessenov <kuat@google.com>
Change-Id: If826ebd48bcf813c42660eadef13557be9c45e90
@kyessenov
Copy link
Contributor Author

/retest

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In preperation for what?

Its hard to review refactorings like this without a sense of where things are going

@kyessenov
Copy link
Contributor Author

@howardjohn All the modified exported functions will be moved to pkg/xds. That's all the code needed for top-level Stream(). It's easier to review in-place before they are moved to pkg/xds.

@howardjohn
Copy link
Member

What does top level Stream() give us? What dependency are we trying to remove from where?

Is this for isolating SDS or xdsproxy?

@kyessenov
Copy link
Contributor Author

kyessenov commented Apr 17, 2024

@howardjohn This is for SDS - the issue is linked. xDS proxy never really shared code with pilot - that's not relevant. SDS only needs the basic RPC scaffold - the push context and connection tracking is just overhead for SDS where it's only one client.

@howardjohn
Copy link
Member

I had thought you decided against doing the SDS work was the main source of confusion. So you are going to make the existing xDS server able to be used without dependencies on rest of Istio for usage in SDS?

@kyessenov
Copy link
Contributor Author

@howardjohn I didn't decide anything. This is just going to be hard - but doable if we split it right in pilot.

@howardjohn
Copy link
Member

Ack, agree its hard. I thought the previous conclusion was "its hard; we will just disable SDS for our build". If it is instead "its hard; we will do a lot of work to make it happen" that is great.

Have you proofed out that there is a viable path towards doing it? So we don't do a bunch of refactoring only to find a few bits are infeasible to isolate?

@kyessenov
Copy link
Contributor Author

I didn't validate end-to-end - but it looks pretty viable. We can hold on until I prepare the full change, and then stage the PRs.

Change-Id: I2238d9699f97ad267c79d20a92f43ee4aec421ea
Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov kyessenov requested a review from a team as a code owner April 18, 2024 00:14
@istio-testing istio-testing added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 18, 2024
@kyessenov
Copy link
Contributor Author

Pushed the whole thing here.

@kyessenov
Copy link
Contributor Author

kyessenov commented Apr 18, 2024

Obviously, not complete since tests fail :) But you get the idea. The final binary size is 28MB.

Change-Id: Idfee89938763d8c1c053a7ad02b533dea12ec2e2
Signed-off-by: Kuat Yessenov <kuat@google.com>
@istio-testing
Copy link
Collaborator

@kyessenov: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
unit-tests-arm64_istio 2744ded link true /test unit-tests-arm64
lint_istio 2744ded link true /test lint
gencheck_istio 2744ded link true /test gencheck
unit-tests_istio 2744ded link true /test unit-tests
integ-pilot_istio 2744ded link true /test integ-pilot
integ-cni_istio 2744ded link true /test integ-cni

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ramaraochavali
Copy link
Contributor

May be we should do the end state PR in another PR using these changes to validate and we can review this PR first if every thing is found to be viable?

@kyessenov
Copy link
Contributor Author

May be we should do the end state PR in another PR using these changes to validate and we can review this PR first if every thing is found to be viable?

Yes, that is the plan. Once John confirms, I'll revert back and review them in pieces.

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good. A bit harder to follow the code due to the indirections/abstractions but that is pretty unavoidable, and I think its pretty reasonable.

We drop like 7mb off pilot-agent which is great

Since this is probably the most critical part of Istio code want to make sure we get thorough review from @hzxuzhonghu @ramaraochavali @keithmattix as well before moving forward on this (and myself -- I went through most of it but could use a deeper look)

return nil
}
con.ids = ids
con.s = s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection having a reference to the DiscoveryServer seems suspicious

Copy link
Contributor Author

@kyessenov kyessenov Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way around defining a discovery server interface. The context has all the info. I don't think it adds any circular data structures.

Comment on lines +208 to +209
// start := time.Now()
// XXX defer func() { recordSendTime(time.Since(start)) }()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably obvious by the XXX, but need to make sure we can add this back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, logging/metrics needs to be improved in the final version.

Comment on lines +153 to +154
s.Lock()
defer s.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to hold the lock this entire time? seems like a hung PushChannel could block everything forever.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nvm, we go func it

func (c *Context) Watcher() xds.Watcher {
return c.w
}
func (c *Watch) DeleteWatchedResource(string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add some assertions the typeurl is always secret?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will do.

"istio.io/istio/pkg/util/sets"
)

var log = istiolog.RegisterScope("xdsserver", "xds debugging")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we might consider using the old ads log scope here... normally I wouldn't care much, but these are the most core logs in Istio, so a lot of people have tooling around processing these. It would also be consistent with delta, which is still all under the delta scope

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can move the scope here.

c.watch = f(c.watch)
}
func (c *Watch) GetID() string {
return ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we return a useful ID here? Maybe we can use the fact we know we only subscribe to 1 resource to make the resource name the log, which would be super handy. Although for the initial connect it would be not yet available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that grabbing a lock can cause issues so the list of resources is probably not easy. Not sure what to put here.

@costinm
Copy link
Contributor

costinm commented Apr 19, 2024

My personal preference remains on simplifying and not having an SDS server in agent just to load a file.

Envoy using SDS for 3P integrations or to get ingress secrets from Istiod is fine - but running an XDS server and all this complexity for a single file seems too much.

@kyessenov
Copy link
Contributor Author

My personal preference remains on simplifying and not having an SDS server in agent just to load a file.

Envoy using SDS for 3P integrations or to get ingress secrets from Istiod is fine - but running an XDS server and all this complexity for a single file seems too much.

Agree, but @howardjohn has strong concerns about changing the code structurally. This is a refactor, so it's much less risky - there's always a risk but the current architecture is understood better.

@costinm
Copy link
Contributor

costinm commented Apr 19, 2024 via email

@costinm
Copy link
Contributor

costinm commented Apr 19, 2024 via email

@kyessenov
Copy link
Contributor Author

kyessenov commented Apr 19, 2024

@costinm We should wait for k8s to define the standard mounting point for SPIFFE certs. There's a linked KEP. Once that is adopted, that's a strong reason to use files directly, and integrate MeshCA, etc with k8s directly, without SDS or Istio. Simply changing Istio to write files locally feels like busy work.

@costinm
Copy link
Contributor

costinm commented Apr 19, 2024

I suspect writing files ( which we already support through one of the env vars - for proxyless) is far smaller PR than refactoring SDS. And the work is moving us in the right direction. The K8S KEP is not that far away from what CertManager support or is easily emulated - even if it is approved it'll take a long time to make it in all supported k8s versions. By that time we'll probably have ztunnel and sandwiches making it mostly irrelevant.

I'm also not sure k8s mounted certs will be in a form envoy will reload - could you confirm ? I thought reload works with the certs in some json form that looks like sds on disk - maybe that changed.

@costinm
Copy link
Contributor

costinm commented Apr 19, 2024

I was going to say it may be relevant for ztunnel only, but of course it may not work with the current architecture....

@howardjohn
Copy link
Member

Its not just reading files. It handles translating certificate files to envoy formats (envoy doesn't read raw .pem), abstracting "system" certificates which vary per platform, various custom private key providers (cryptomb, etc), merging multiple trust bundles and dynamically updating them, ...

Also, there has been some desire to not store private key material on disk.

@istio-testing
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Apr 20, 2024
@costinm
Copy link
Contributor

costinm commented Apr 21, 2024 via email

@ramaraochavali
Copy link
Contributor

There are file mounted certs use case that looks at various user configured locations reads/updates when they change, validate them before sending to Envoy etc etc. As @howardjohn mentioned it is not just simple read of files.

@ramaraochavali
Copy link
Contributor

It was
also for Ingress with mounted files - but we moved to Istiod dealing with
the Secrets.

This is still a very valid usecase for many companies like us

@hzxuzhonghu
Copy link
Member

And that remains possible - as an integration. Envoy can still use SDS - my
suggestion is to just remove it from the agent ( or to have an alternative
agent that is small and simple).

For example Envoy Gateway doesn't even have an agent.

That would break many things. EnvoyGateway is simpler compared with istio. I think in EG there is no cert signing, just pushing user provided certificates is enough

@kyessenov
Copy link
Contributor Author

@hzxuzhonghu @ramaraochavali Are we OK with going forward with this refactor? I don't see a problem with adding a simplified file based SDS later on per @costinm, it'd be a new feature anyways. We can do it as soon as the branch is cut.

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to have a reduced binary size, there doesn't seem to be much point in having more options.

small+large=large in binary size

@kyessenov
Copy link
Contributor Author

@howardjohn It depends, the current proposal that gRPC implemented uses code in Envoy to retrieve certificates directly (without SDS at all). That would reduce the binary size. Again, there's no final design for this right now.

@costinm
Copy link
Contributor

costinm commented Apr 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR needs to be rebased before being merged release-notes-none Indicates a PR that does not require release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants