New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: refactor server code in preparation for a move #50515
Conversation
Change-Id: Icda9014d7637ea3b408093278a570fb2e0614926
Change-Id: Iafe4ccf1fd7371cc04afba0fc0a85d03acc04a46 Signed-off-by: Kuat Yessenov <kuat@google.com>
Change-Id: If826ebd48bcf813c42660eadef13557be9c45e90
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In preperation for what?
Its hard to review refactorings like this without a sense of where things are going
@howardjohn All the modified exported functions will be moved to |
What does top level Stream() give us? What dependency are we trying to remove from where? Is this for isolating SDS or xdsproxy? |
@howardjohn This is for SDS - the issue is linked. xDS proxy never really shared code with pilot - that's not relevant. SDS only needs the basic RPC scaffold - the push context and connection tracking is just overhead for SDS where it's only one client. |
I had thought you decided against doing the SDS work was the main source of confusion. So you are going to make the existing xDS server able to be used without dependencies on rest of Istio for usage in SDS? |
@howardjohn I didn't decide anything. This is just going to be hard - but doable if we split it right in pilot. |
Ack, agree its hard. I thought the previous conclusion was "its hard; we will just disable SDS for our build". If it is instead "its hard; we will do a lot of work to make it happen" that is great. Have you proofed out that there is a viable path towards doing it? So we don't do a bunch of refactoring only to find a few bits are infeasible to isolate? |
I didn't validate end-to-end - but it looks pretty viable. We can hold on until I prepare the full change, and then stage the PRs. |
Pushed the whole thing here. |
Obviously, not complete since tests fail :) But you get the idea. The final binary size is 28MB. |
@kyessenov: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
May be we should do the end state PR in another PR using these changes to validate and we can review this PR first if every thing is found to be viable? |
Yes, that is the plan. Once John confirms, I'll revert back and review them in pieces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty good. A bit harder to follow the code due to the indirections/abstractions but that is pretty unavoidable, and I think its pretty reasonable.
We drop like 7mb off pilot-agent which is great
Since this is probably the most critical part of Istio code want to make sure we get thorough review from @hzxuzhonghu @ramaraochavali @keithmattix as well before moving forward on this (and myself -- I went through most of it but could use a deeper look)
return nil | ||
} | ||
con.ids = ids | ||
con.s = s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connection having a reference to the DiscoveryServer seems suspicious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a way around defining a discovery server interface. The context has all the info. I don't think it adds any circular data structures.
// start := time.Now() | ||
// XXX defer func() { recordSendTime(time.Since(start)) }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably obvious by the XXX
, but need to make sure we can add this back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, logging/metrics needs to be improved in the final version.
s.Lock() | ||
defer s.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to hold the lock this entire time? seems like a hung PushChannel could block everything forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nvm, we go func
it
func (c *Context) Watcher() xds.Watcher { | ||
return c.w | ||
} | ||
func (c *Watch) DeleteWatchedResource(string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe add some assertions the typeurl is always secret?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do.
"istio.io/istio/pkg/util/sets" | ||
) | ||
|
||
var log = istiolog.RegisterScope("xdsserver", "xds debugging") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we might consider using the old ads
log scope here... normally I wouldn't care much, but these are the most core logs in Istio, so a lot of people have tooling around processing these. It would also be consistent with delta
, which is still all under the delta
scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can move the scope here.
c.watch = f(c.watch) | ||
} | ||
func (c *Watch) GetID() string { | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we return a useful ID here? Maybe we can use the fact we know we only subscribe to 1 resource to make the resource name the log, which would be super handy. Although for the initial connect it would be not yet available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that grabbing a lock can cause issues so the list of resources is probably not easy. Not sure what to put here.
My personal preference remains on simplifying and not having an SDS server in agent just to load a file. Envoy using SDS for 3P integrations or to get ingress secrets from Istiod is fine - but running an XDS server and all this complexity for a single file seems too much. |
Agree, but @howardjohn has strong concerns about changing the code structurally. This is a refactor, so it's much less risky - there's always a risk but the current architecture is understood better. |
On Fri, Apr 19, 2024, 15:42 Kuat ***@***.***> wrote:
My personal preference remains on simplifying and not having an SDS server
in agent just to load a file.
Envoy using SDS for 3P integrations or to get ingress secrets from Istiod
is fine - but running an XDS server and all this complexity for a single
file seems too much.
Agree, but @howardjohn <https://github.com/howardjohn> has strong
concerns about changing the code structurally. This is a refactor, so it's
much less risky - there's always a risk but the current architecture is
understood better.
I don't know how well understood it is - or if it's 'architecture' or a
long series of code changes that may be long obsolete.
If I remember correctly we started with files, and switched to SDS because
Envoy didn't support reloading files. That's no longer the case. It was
also for Ingress with mounted files - but we moved to Istiod dealing with
the Secrets.
If someone knows any reason still valid for this 'architecture' - would be
great if more people undertood it..
…
Message ID: ***@***.***>
|
I'm not against this change - just trying to understand.
…On Fri, Apr 19, 2024, 16:07 Costin Manolache ***@***.***> wrote:
On Fri, Apr 19, 2024, 15:42 Kuat ***@***.***> wrote:
> My personal preference remains on simplifying and not having an SDS
> server in agent just to load a file.
>
> Envoy using SDS for 3P integrations or to get ingress secrets from Istiod
> is fine - but running an XDS server and all this complexity for a single
> file seems too much.
>
> Agree, but @howardjohn <https://github.com/howardjohn> has strong
> concerns about changing the code structurally. This is a refactor, so it's
> much less risky - there's always a risk but the current architecture is
> understood better.
>
I don't know how well understood it is - or if it's 'architecture' or a
long series of code changes that may be long obsolete.
If I remember correctly we started with files, and switched to SDS because
Envoy didn't support reloading files. That's no longer the case. It was
also for Ingress with mounted files - but we moved to Istiod dealing with
the Secrets.
If someone knows any reason still valid for this 'architecture' - would be
great if more people undertood it..
>
> Message ID: ***@***.***>
>
|
@costinm We should wait for k8s to define the standard mounting point for SPIFFE certs. There's a linked KEP. Once that is adopted, that's a strong reason to use files directly, and integrate MeshCA, etc with k8s directly, without SDS or Istio. Simply changing Istio to write files locally feels like busy work. |
I suspect writing files ( which we already support through one of the env vars - for proxyless) is far smaller PR than refactoring SDS. And the work is moving us in the right direction. The K8S KEP is not that far away from what CertManager support or is easily emulated - even if it is approved it'll take a long time to make it in all supported k8s versions. By that time we'll probably have ztunnel and sandwiches making it mostly irrelevant. I'm also not sure k8s mounted certs will be in a form envoy will reload - could you confirm ? I thought reload works with the certs in some json form that looks like sds on disk - maybe that changed. |
I was going to say it may be relevant for ztunnel only, but of course it may not work with the current architecture.... |
Its not just reading files. It handles translating certificate files to envoy formats (envoy doesn't read raw .pem), abstracting "system" certificates which vary per platform, various custom private key providers (cryptomb, etc), merging multiple trust bundles and dynamically updating them, ... Also, there has been some desire to not store private key material on disk. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
On Fri, Apr 19, 2024, 16:47 John Howard ***@***.***> wrote:
Its not just reading files. It handles translating certificate files to
envoy formats (envoy doesn't read raw .pem), abstracting "system"
certificates which vary per platform, various custom private key providers
(cryptomb, etc), merging multiple trust bundles and dynamically updating
them, ...
Sure, but for 99% of users it's about reading a cert from disk. Envoy has
some weirdness - the rest is Istio invented complexity. If you look a t
proxyless gRPC and most TLS uses outside Istio - it's just files.
Not saying we should stop supporting SDS in envoy and as integration point
- but we can have a simpler agent for the common case, without the insane
complexity for use cases that are rarely needed ( as evidenced by the fact
everyone else can use simpler files for certs)
Also, there has been some desire to not store private key material on disk.
And that remains possible - as an integration. Envoy can still use SDS - my
suggestion is to just remove it from the agent ( or to have an alternative
agent that is small and simple).
For example Envoy Gateway doesn't even have an agent.
… .Message ID: ***@***.***>
|
There are file mounted certs use case that looks at various user configured locations reads/updates when they change, validate them before sending to Envoy etc etc. As @howardjohn mentioned it is not just simple read of files. |
This is still a very valid usecase for many companies like us |
That would break many things. EnvoyGateway is simpler compared with istio. I think in EG there is no cert signing, just pushing user provided certificates is enough |
@hzxuzhonghu @ramaraochavali Are we OK with going forward with this refactor? I don't see a problem with adding a simplified file based SDS later on per @costinm, it'd be a new feature anyways. We can do it as soon as the branch is cut. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to have a reduced binary size, there doesn't seem to be much point in having more options.
small+large=large
in binary size
@howardjohn It depends, the current proposal that gRPC implemented uses code in Envoy to retrieve certificates directly (without SDS at all). That would reduce the binary size. Again, there's no final design for this right now. |
Rama - there are many use cases, but they are better addressed using the
extensibility mechanisms and a custom provider.
CertManager, Spire, new K8S cert proposal, etc do provide support for a
large number of use cases. Spire ( and others ) may
provide a SDS socket - the others will provide files. If you need something
even more custom - it is possible to have
a daemon set or extend one of the providers.
Istio has a minimal CA - with far from ideal security model and feature set
- for the common use case. We added a ton of
shadow API ( annotations, env, MeshConfig etc) that allow a mostly
undocumented set of use cases - plus a lot
of complicated code in the agent to support them. Most of this will
probably break anyway with ambient.
Having an alternative - where Envoy only loads files ( provided by a K8S
CSI-like provider - CertManager, etc ) or
by a SDS provider ( Spire, custom ones, etc ) - and no agent-related SDS -
would reduce complexity and
provide a more clear API surface. With conditional compilation or an
alternative clean agent users can avoid
a lot of dependencies and risks - it's not about binary size, but
complexity, support, stability and interoperability.
…On Mon, Apr 22, 2024 at 10:41 AM Kuat ***@***.***> wrote:
@howardjohn <https://github.com/howardjohn> It depends, the current
proposal that gRPC implemented uses code in Envoy to retrieve certificates
directly (without SDS at all). That would reduce the binary size. Again,
there's no final design for this right now.
—
Reply to this email directly, view it on GitHub
<#50515 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUR2WOMO7B6QKAQK4MQILY6VDTJAVCNFSM6AAAAABGL7B2O2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQGM4DKOJXHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Add several interfaces to encapsulate push context, generators, proxy, etc, that are not needed by SDS server.
Issue: #50134