Gateway API stops working after restart of single node 'cluster' #32596

Lennie · 2024-05-16T23:02:03Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

When I have a single node cluster, and reboot the machine, after everything has started again, gateway API isn't working correctly. Especially deploying new gateway resources. The reason seems pretty simple, at start up, cilium tries to talk to Kube API to get the gateway CRD and gets a timeout. It retries but fails to actually get the needed response. Maybe the non-existence of the CRD is cached ? It repeats for a while until it gives up

A describe of a new gateway resource just says: waiting for controller

Cilium Version

tested with:
1.15.4
1.15.5
1.16.0-pre.2

1.16.0-pre.2 has no output

Kernel Version

tested with:
6.1.0-18-amd64

Kubernetes Version

tested with: v1.29.5

Regression

No response

Sysdump

No response

Relevant log output

level=info msg="Checking for required GatewayAPI resources" requiredGVK="[gateway.networking.k8s.io/v1, Kind=gatewayclasses gateway.networking.k8s.io/v1, Kind=gateways gateway.networking.k8s.io/v1, Kind=httproutes gateway.networking.k8s.io/v1beta1, Kind=referencegrants gateway.networking.k8s.io/v1alpha2, Kind=grpcroutes gateway.networking.k8s.io/v1alpha2, Kind=tlsroutes]" subsys=gateway-api
level=error msg="Required GatewayAPI resources are not found, please refer to docs for installation instructions" error="Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/gatewayclasses.gateway.networking.k8s.io\": dial tcp 10.96.0.1:443: i/o timeout" subsys=gateway-api

10.96.0.1 is clusterIP of the kube-apiserver

Later on we do see it retrying, but it's not working, maybe the non-existence is cached ? It repeats for a while until it gives up:

level=error msg="kind must be registered to the Scheme" error="no kind is registered for the type v1.Gateway in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:80\"" logger=controller-runtime.source.EventHandler subsys=controller-runtime

Anything else?

No response

Cilium Users Document

Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

I agree to follow this project's Code of Conduct

lmb · 2024-05-21T13:39:14Z

Are you running into this on kind or similar? A way to reproduce this would be great.

Lennie added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 16, 2024

lmb added the need-more-info More information is required to further debug or fix the issue. label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway API stops working after restart of single node 'cluster' #32596

Gateway API stops working after restart of single node 'cluster' #32596

Lennie commented May 16, 2024 •

edited

lmb commented May 21, 2024

Gateway API stops working after restart of single node 'cluster' #32596

Gateway API stops working after restart of single node 'cluster' #32596

Comments

Lennie commented May 16, 2024 • edited

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

lmb commented May 21, 2024

Lennie commented May 16, 2024 •

edited