[BUG] azure rke1 node driver not working - `websocket: close 1006 (abnormal closure): unexpected EOF` #45398

slickwarren · 2024-05-06T22:01:32Z

Rancher Server Setup

Rancher version:v2.8-head (196505e)
Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):k3s 1.28.9+k3s1
Proxy/Cert Details: cert-manager

Information about the Cluster

Kubernetes version: any (tested on 1.28.9 and 1.27.13 rancher1-1)
Cluster Type (Local/Downstream): downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): node driver, Azure

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
Tested with Admin and Standard user, cluster owner

Describe the bug

provisioning using azure node driver is not working as expected, hanging after the last node has registered with the cluster in a new cluster setup, stuck in a waiting state

To Reproduce

provision a node driver cluster using rke1 with default settings for azure

Result
all nodes are active, but cluster stuck in a waiting state

Expected Result

cluster should come to an active state

Screenshots

Additional context

logs from local cluster that may be relevant:

2024/05/06 21:27:26 [INFO] EnsureSecretForServiceAccount: waiting for secret [cattle-impersonation-u-lo7gxkk5cu-token-vw2fj] to be populated with token
2024/05/06 21:27:34 [INFO] Creating system token for u-lo7gxkk5cu, token: agent-u-lo7gxkk5cu
2024/05/06 21:27:36 [INFO] Handling backend connection request [c-fwb8b:m-6b2lb]
2024/05/06 21:27:36 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
2024/05/06 21:27:40 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:45531
2024/05/06 21:27:41 [INFO] kontainerdriver rancherkubernetesengine stopped
2024/05/06 21:27:41 [INFO] clusterDeploy: redeployAgent: redeploy Rancher agents due to toleration mismatch for [c-fwb8b], was [[]] and will be [[{node-role.kubernetes.io/controlplane true NoSchedule <nil>}]]
2024-05-06T21:27:41.369372608Z 2024/05/06 21:27:41 [INFO] Creating system token for u-lo7gxkk5cu, token: agent-u-lo7gxkk5cu
W0506 21:27:47.369973      39 warnings.go:80] cluster.x-k8s.io/v1alpha3 MachineDeployment is deprecated; use cluster.x-k8s.io/v1beta1 MachineDeployment
2024/05/06 21:27:48 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:32927
2024/05/06 21:27:48 [INFO] kontainerdriver rancherkubernetesengine stopped
W0506 21:27:50.595001      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.ClusterRoleBinding ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595593687Z W0506 21:27:50.595458      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.ClusterRole ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595602727Z W0506 21:27:50.595528      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.RoleBinding ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0506 21:27:50.595807      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.ServiceAccount ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595900252Z W0506 21:27:50.595847      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.Role ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.595904052Z W0506 21:27:50.595868      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.Secret ended wi
Comments
@slickwarren
Member
th: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024-05-06T21:27:50.596362629Z W0506 21:27:50.596095      39 reflector.go:458] pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
2024/05/06 21:27:53 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:36511
2024/05/06 21:27:53 [INFO] kontainerdriver rancherkubernetesengine stopped
I0506 21:28:35.246033      39 trace.go:236] Trace[793466831]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.28.6-rancher1/tools/cache/reflector.go:229 (06-May-2024 21:27:51.855) (total time: 43390ms):
2024-05-06T21:28:35.246183149Z Trace[793466831]: ---"Objects listed" error:<nil> 43390ms (21:28:35.245)
2024-05-06T21:28:35.246186899Z Trace[793466831]: [43.390929326s] [43.390929326s] END

currently tested using 1 node per role
not affecting linode node driver

The text was updated successfully, but these errors were encountered:

jiaqiluo · 2024-05-07T17:31:04Z

It turns out to be a known issue, and the workaround that suggests changing the dnspolicy on the cattle-cluster-agent deployment from ClusterFirst to Default works and brings the cluster to be active.

slickwarren self-assigned this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] azure rke1 node driver not working - `websocket: close 1006 (abnormal closure): unexpected EOF` #45398

[BUG] azure rke1 node driver not working - `websocket: close 1006 (abnormal closure): unexpected EOF` #45398

slickwarren commented May 6, 2024 •

edited

jiaqiluo commented May 7, 2024

[BUG] azure rke1 node driver not working - websocket: close 1006 (abnormal closure): unexpected EOF #45398

[BUG] azure rke1 node driver not working - websocket: close 1006 (abnormal closure): unexpected EOF #45398

Comments

slickwarren commented May 6, 2024 • edited

jiaqiluo commented May 7, 2024

[BUG] azure rke1 node driver not working - `websocket: close 1006 (abnormal closure): unexpected EOF` #45398

[BUG] azure rke1 node driver not working - `websocket: close 1006 (abnormal closure): unexpected EOF` #45398

slickwarren commented May 6, 2024 •

edited