Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global usage of DaemonConfig is volatile #32557

Open
2 of 3 tasks
tommyp1ckles opened this issue May 15, 2024 · 0 comments
Open
2 of 3 tasks

Global usage of DaemonConfig is volatile #32557

tommyp1ckles opened this issue May 15, 2024 · 0 comments
Labels
kind/bug This is a bug in the Cilium logic. sig/agent Cilium agent related.

Comments

@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented May 15, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Background

The global instance of *option.DaemonConfig (i.e.option.Config) is our legacy daemon configuration struct that contains many critical config flags. Historically all config was added to this and populated with viper to parse flags and config input.

Since moving to Hive, we've continued to use this for legacy reasons as this is heavily intertwined with many runtime startup processes in newDaemon. To complicate matters more, newDaemon does many parameter overrides and may mutated various fields.

For example:

https://github.com/cilium/cilium/blob/main/daemon/cmd/kube_proxy_replacement.go#L55-L59

Problem

The problem arises from the fact that option.Config is accessed as a global variable, as well as pointer provided to Hive. This latter being misleading in that taking a option.DaemonConfig reference as a dependency does not actually enforce any ordering constraints.

Furthermore, because newDaemon is actually run after the Hive has started, this means that configuration that can be mutated at this stage is not "finalized" until runtime. That is, the only way to synchronize on a finalized version of DaemonConfig is to Await() on the promise.Promise[*Daemon] type.

This is not ideal as it may force moving logic into the "runtime" phase (i.e. after hive.Start()) - essentially re-implementing dependency ordering using promise types.

The ultimate solution would be to break up DaemonConfig and move various bits of configuration into config structs provided by cell.Config. However this is easier said than done, take for example EnableNodeConfig as it can be overridden in at least these three places (specifically, in the order presented):

  1. initKubeProxyReplacement: first pass at initializing kpr config, performs host OS system probing to determine configuration overrides.
  2. device detection: if certain interface devices cannot be detected we disable enable-node-port.
  3. finishInitKubeProxyReplacement: finalizes initialing kpr config.

Thus, to make EnableNodePort stable prior to runtime we'd need to move all the above logic in the Populate phase.
This seems not ideal for the following reasons:

  • Ideally, the "Populate" phase should minimize any IO based tasks, as it's purpose is to create a dependency graph based only on static configuration.
  • As an example, moving things like KPR probing or device detection to the Populate phase may break things like cilium-agent hive this could now fail due to things like host OS state.
  • Errors returned from hive.Populate() are not really intended to ergonomically expose user facing errors, as the error would be prefixed with a bunch of hive dependency context.

Mitigation and Solutions

Proposed mitigation: #32558

Cilium Version

All existing versions

Kernel Version

n/a

Kubernetes Version

n/a

Regression

n/a

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tommyp1ckles tommyp1ckles added the kind/bug This is a bug in the Cilium logic. label May 15, 2024
tommyp1ckles added a commit to tommyp1ckles/cilium that referenced this issue May 15, 2024
As described in this issue: cilium#32557

Access to a subset of fields in *DaemonConfig is dangerous as
they are not finalized until runtime.

This makes access to those fields  explicit by putting them behind
a call to Volatile() - as well as any derived config function (i.e.
anything that uses a 'volatile' field).

This is intended to clearly distinguish daemon config that is not
finalized until newDaemon is complete.

In subsequent commits we will hook functionality into the Volatile
function to check for improper usage.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.co
@squeed squeed added the sig/agent Cilium agent related. label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

2 participants