Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] azure rke1 node driver not working - websocket: close 1006 (abnormal closure): unexpected EOF #45398

Open
slickwarren opened this issue May 6, 2024 · 1 comment
Assignees
Labels
area/provisioning-rke1 Provisioning issues with RKE1 kind/bug Issues that are defects reported by users or that we know have reached a real release kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support team/rke1

Comments

@slickwarren
Copy link
Contributor

slickwarren commented May 6, 2024

Rancher Server Setup

  • Rancher version:v2.8-head (196505e)
  • Installation option (Docker install/Helm Chart):
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):k3s 1.28.9+k3s1
  • Proxy/Cert Details: cert-manager

Information about the Cluster

  • Kubernetes version: any (tested on 1.28.9 and 1.27.13 rancher1-1)
  • Cluster Type (Local/Downstream): downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): node driver, Azure

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    Tested with Admin and Standard user, cluster owner

Describe the bug

provisioning using azure node driver is not working as expected, hanging after the last node has registered with the cluster in a new cluster setup, stuck in a waiting state

To Reproduce

provision a node driver cluster using rke1 with default settings for azure

Result
all nodes are active, but cluster stuck in a waiting state

Expected Result

cluster should come to an active state

Screenshots

Additional context

logs from local cluster that may be relevant:

2024/05/06 21:27:26 [INFO] EnsureSecretForServiceAccount: waiting for secret [cattle-impersonation-u-lo7gxkk5cu-token-vw2fj] to be populated with token
2024/05/06 21:27:34 [INFO] Creating system token for u-lo7gxkk5cu, token: agent-u-lo7gxkk5cu
2024/05/06 21:27:36 [INFO] Handling backend connection request [c-fwb8b:m-6b2lb]
2024/05/06 21:27:36 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
2024/05/06 21:27:40 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:45531
2024/05/06 21:27:41 [INFO] kontainerdriver rancherkubernetesengine stopped
2024/05/06 21:27:41 [INFO] clusterDeploy: redeployAgent: redeploy Rancher agents due to toleration mismatch for [c-fwb8b], was [[]] and will be [[{node-role.kubernetes.io/controlplane true NoSchedule <nil>}]]
2024-05-06T21:27:41.369372608Z 2024/05/06 21:27:41 [INFO] Creating system token for u-lo7gxkk5cu, token: agent-u-lo7gxkk5cu
W0506 21:27:47.369973      39 warnings.go:80] cluster.x-k8s.io/v1alpha3 MachineDeployment is deprecated; use cluster.x-k8s.io/v1beta1 MachineDeployment
2024/05/06 21:27:48 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:32927
2024/05/06 21:27:48 [INFO] kontainerdriver rancherkubernetesengine stopped
W0506 21:27:50.595001      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.ClusterRoleBinding ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595593687Z W0506 21:27:50.595458      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.ClusterRole ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595602727Z W0506 21:27:50.595528      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.RoleBinding ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0506 21:27:50.595807      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.ServiceAccount ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595900252Z W0506 21:27:50.595847      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.Role ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595904052Z W0506 21:27:50.595868      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.Secret ended wi
Comments
@slickwarren
Member
th: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.596362629Z W0506 21:27:50.596095      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024/05/06 21:27:53 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:36511
2024/05/06 21:27:53 [INFO] kontainerdriver rancherkubernetesengine stopped
I0506 21:28:35.246033      39 trace.go:236] Trace[793466831]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229 (06-May-2024 21:27:51.855) (total time: 43390ms):
2024-05-06T21:28:35.246183149Z Trace[793466831]: ---"Objects listed" error:<nil> 43390ms (21:28:35.245)
2024-05-06T21:28:35.246186899Z Trace[793466831]: [43.390929326s] [43.390929326s] END

currently tested using 1 node per role
not affecting linode node driver

@slickwarren slickwarren added kind/bug Issues that are defects reported by users or that we know have reached a real release kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/release-blocker area/provisioning-rke1 Provisioning issues with RKE1 team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support team/rke1 labels May 6, 2024
@slickwarren slickwarren self-assigned this May 6, 2024
@jiaqiluo
Copy link
Member

jiaqiluo commented May 7, 2024

It turns out to be a known issue, and the workaround that suggests changing the dnspolicy on the cattle-cluster-agent deployment from ClusterFirst to Default works and brings the cluster to be active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provisioning-rke1 Provisioning issues with RKE1 kind/bug Issues that are defects reported by users or that we know have reached a real release kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support team/rke1
Projects
None yet
Development

No branches or pull requests

2 participants