Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reach the kube-dns from external workload #32517

Closed
2 of 3 tasks
Timen-GitHub-User opened this issue May 13, 2024 · 4 comments
Closed
2 of 3 tasks

Unable to reach the kube-dns from external workload #32517

Timen-GitHub-User opened this issue May 13, 2024 · 4 comments
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps.

Comments

@Timen-GitHub-User
Copy link

Timen-GitHub-User commented May 13, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Environment:

K8-node:
Running minikube on my WSL2 Ubuntu instance with a custom compiled kernel created with this manual https://wsl.dev/wslcilium/, stopped at checkpoint 1 as the rest was irrelevant for my setup. Running this in WSL2 instead of VirtualBox because of performance issues.

External workload:
Running ubuntu VM in VirtualBox with one NAT interface.

OpenVPN:
Used OpenVPN server (deamon) setup on the WSL2 ubuntu instance, using a "tun" interface to directly connect the external VM to the K8-node. See config file below.

https://docs.cilium.io/en/v1.15/network/external-workloads/ is the manual I use to set up the external workload setup.

Spinning up a cluster with these commands:

  • minikube start --network-plugin=cni --enable-default-cni=false
  • cilium install --version 1.15.4 --set routingMode=tunnel
  • cilium clustermesh enable --service-type LoadBalancer --enable-external-workloads
  • minikube tunnel -c --bind-address 172.16.0.1 # ip adres of the OpenVPN tun interface. shows 127.0.0.1 on external IP clustermesh-apiserver LoadBalancer service, but is reachable trough the VPN.
  • cilium clustermesh vm create external-vm -n default --ipv4-alloc-cidr 192.168.69.0/30
  • cilium clustermesh vm install install-external-workload.sh # change "CLUSTER_ADDR" to 172.16.0.1 after creation of file. Copy to external VM

Command on External Workload:

  • sudo HOST_IP=172.16.0.2 ./install-external-workload.sh

Network:
image

What I tried

I know the 10.96.00/24 network isnt known in the routing table of the external workload. So I tried 2 things:

  • setting the destination of "10.0.0.0/24 via 192.168.69.1 dev cilium_host proto kernel src 192.168.69.1 mtu 1450" to 10.0.0.0/8
  • Creating a route directly via the OpenVPN interface.
    Both didnt work. Also changed my network setup so that the theres only one 10.0.0.0/8 ip range in my whole setup (default OpenVPN subnet is "10.8.0.0/24", etc), so the 10.0.0.0/8 route would could work withoud conficting routes.

Output:

K8-node:

user@ubuntu:~/CiliumEW$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.28.143.153  netmask 255.255.240.0  broadcast 172.28.143.255
        inet6 fe80::215:5dff:fe86:4629  prefixlen 64  scopeid 0x20<link>
        ether 00:15:5d:86:46:29  txqueuelen 1000  (Ethernet)
        RX packets 20631  bytes 3747594 (3.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9772  bytes 2160617 (2.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 225155  bytes 178766179 (178.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 225155  bytes 178766179 (178.7 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet 172.16.0.1  netmask 255.255.255.0  destination 172.16.0.1
        inet6 fe80::81e6:5d1f:cea2:bfe4  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)
        RX packets 12007  bytes 1387802 (1.3 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 7435  bytes 1278344 (1.2 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
user@ubuntu:~/CiliumEW$ ip route
default via 172.28.128.1 dev eth0 proto kernel
172.16.0.0/24 dev tun0 proto kernel scope link src 172.16.0.1
172.28.128.0/20 dev eth0 proto kernel scope link src 172.28.143.153
user@ubuntu:~/CiliumEW$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
 \__/¯¯\__/    Hubble Relay:       disabled
    \__/       ClusterMesh:        OK

Deployment             clustermesh-apiserver    Desired: 1, Ready: 1/1, Available: 1/1
Deployment             cilium-operator          Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet              cilium                   Desired: 1, Ready: 1/1, Available: 1/1
Containers:            cilium                   Running: 1
                       cilium-operator          Running: 1
                       clustermesh-apiserver    Running: 1
Cluster Pods:          3/2 managed by Cilium
Helm chart version:
Image versions         cilium                   quay.io/cilium/cilium:v1.15.4@sha256:b760a4831f5aab71c711f7537a107b751d0d0ce90dd32d8b358df3c5da385426: 1
                       cilium-operator          quay.io/cilium/operator-generic:v1.15.4@sha256:404890a83cca3f28829eb7e54c1564bb6904708cdb7be04ebe69c2b60f164e9a: 1
                       clustermesh-apiserver    quay.io/cilium/clustermesh-apiserver:v1.15.4@sha256:3fadf85d2aa0ecec09152e7e2d57648bda7e35bdc161b25ab54066dd4c3b299c: 2
user@ubuntu:~/CiliumEW$ k get svc -n kube-system
NAME                            TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
clustermesh-apiserver           LoadBalancer   10.104.22.219   127.0.0.1     2379:30921/TCP           4h21m
clustermesh-apiserver-metrics   ClusterIP      None            <none>        9962/TCP,9963/TCP        4h21m
hubble-peer                     ClusterIP      10.109.75.165   <none>        443/TCP                  4h22m
kube-dns                        ClusterIP      10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   4h22m
user@ubuntu:~/CiliumEW$ kubectl get cep
NAME          SECURITY IDENTITY   ENDPOINT STATE   IPV4         IPV6
external-vm   23062               ready            172.16.0.2   fc00::10ca:1

External Workload:

user@external-vm:/var/www/html$ ifconfig
cilium_host: flags=4291<UP,BROADCAST,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet 192.168.69.1  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 f00d::a04:0:0:35d4  prefixlen 128  scopeid 0x0<global>
        inet6 fe80::2819:e6ff:fe85:a5ee  prefixlen 64  scopeid 0x20<link>
        ether 2a:19:e6:85:a5:ee  txqueuelen 1000  (Ethernet)
        RX packets 4  bytes 440 (440.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5  bytes 550 (550.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cilium_net: flags=4291<UP,BROADCAST,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet6 fe80::280d:4dff:fe91:6b13  prefixlen 64  scopeid 0x20<link>
        ether 2a:0d:4d:91:6b:13  txqueuelen 1000  (Ethernet)
        RX packets 5  bytes 550 (550.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4  bytes 440 (440.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cilium_vxlan: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::c43:1bff:fe86:7713  prefixlen 64  scopeid 0x20<link>
        ether 0e:43:1b:86:77:13  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 744  bytes 43441 (43.4 KB)
        TX errors 5  dropped 0 overruns 0  carrier 5  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:c4:a3:a7:5b  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.2.4  netmask 255.255.255.0  broadcast 172.16.2.255
        inet6 fe80::a00:27ff:fef8:571b  prefixlen 64  scopeid 0x20<link>
        ether 08:00:27:f8:57:1b  txqueuelen 1000  (Ethernet)
        RX packets 19340  bytes 14850446 (14.8 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 16383  bytes 2511451 (2.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1761  bytes 114857 (114.8 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1761  bytes 114857 (114.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet 172.16.0.2  netmask 255.255.255.0  destination 172.16.0.2
        inet6 fe80::fb2f:2e8e:2e54:27fc  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)
        RX packets 3303  bytes 557599 (557.5 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5357  bytes 612461 (612.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
user@external-vm:/var/www/html$ ip route
default via 172.16.2.1 dev enp0s3 proto dhcp src 172.16.2.4 metric 100
1.1.1.1 via 172.16.2.1 dev enp0s3 proto dhcp src 172.16.2.4 metric 100
8.8.8.8 via 172.16.2.1 dev enp0s3 proto dhcp src 172.16.2.4 metric 100
10.0.0.0/24 via 192.168.69.1 dev cilium_host proto kernel src 192.168.69.1 mtu 1450
172.16.0.0/24 dev tun0 proto kernel scope link src 172.16.0.2
172.16.2.0/24 dev enp0s3 proto kernel scope link src 172.16.2.4 metric 100
172.16.2.1 dev enp0s3 proto dhcp scope link src 172.16.2.4 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.69.0/30 via 192.168.69.1 dev cilium_host proto kernel src 192.168.69.1
192.168.69.1 dev cilium_host proto kernel scope link
user@external-vm:/var/www/html$ sudo cilium-dbg status
KVStore:                 Ok         etcd: 1/1 connected, leases=1, lock leases=1, has-quorum=true: https://clustermesh-apiserver.cilium.io:2379 - 3.5.13 (Leader)
Kubernetes:              Disabled
Host firewall:           Disabled
SRv6:                    Disabled
CNI Chaining:            none
Cilium:                  Ok   1.15.4 (v1.15.4-9b3f9a8c)
NodeMonitor:             Disabled
Cilium health daemon:    Ok
IPAM:                    IPv4: 1/2 allocated from 192.168.69.0/30, IPv6: 1/4294967294 allocated from f00d::a04:0:0:0/96
IPv4 BIG TCP:            Disabled
IPv6 BIG TCP:            Disabled
BandwidthManager:        Disabled
Host Routing:            Legacy
Masquerading:            IPTables [IPv4: Enabled, IPv6: Enabled]
Controller Status:       14/14 healthy
Proxy Status:            OK, ip 192.168.69.1, 0 redirects active on ports 10000-20000, Envoy: embedded
Global Identity Range:   min 256, max 65535
Hubble:                  Disabled
Encryption:              Disabled
Cluster health:                     Probe disabled
user@external-vm:/var/www/html$ nslookup -norecurse clustermesh-apiserver.kube-system.svc.cluster.local
;; communications error to 10.96.0.10#53: timed out
;; communications error to 10.96.0.10#53: timed out
;; communications error to 10.96.0.10#53: timed out
Server:         8.8.8.8
Address:        8.8.8.8#53

** server can't find clustermesh-apiserver.kube-system.svc.cluster.local: NXDOMAIN

OpenVPN server.conf file:

local 172.28.143.153
dev tun
proto udp
port 1194
ca ###
cert ###
key ###
dh none
ecdh-curve ###
topology subnet
server 172.16.0.0 255.255.255.0
client-to-client
client-config-dir /etc/openvpn/ccd
keepalive 15 120
remote-cert-tls client
tls-version-min 1.2
 tls-crypt ###
cipher AES-256-CBC
auth SHA256
user openvpn
group openvpn
persist-key
persist-tun
crl-verify ###
status /var/log/openvpn-status.log 20
status-version 3
syslog
verb 3

Cilium Version

on minikube node:
cilium-cli: v0.16.4 compiled with go1.22.1 on linux/amd64
cilium image (default): v1.15.3
cilium image (stable): v1.15.4
cilium image (running): 1.15.4

on external workload:
Client: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/amd64
Daemon: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/amd64

Kernel Version

on minikube node:
Linux ubuntu 5.15.153.1-microsoft-standard-WSL2+ #1 SMP Tue Apr 30 09:50:49 CEST 2024 x86_64 x86_64 x86_64 GNU/Linux

on external workload:
Linux external-vm 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux###

Kubernetes Version

on k8-node:
minikube version: v1.33.0
commit: 86fc9d54fca63f295d8737c8eacdbb7987e89c67

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

on external workload:
Docker version 26.1.1, build 4cf5afa

Regression

No response

Sysdump

cilium-sysdump-20240513-154955.zip

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Timen-GitHub-User Timen-GitHub-User added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 13, 2024
@squeed
Copy link
Contributor

squeed commented May 16, 2024

Thanks for your thorough explanation! However, we generally shy away from such complex topologies, as things can easily go wrong.

I don't know that any of us have enough experience with WSL and docker on WSL. At the end of the day, our primary use-case is Kubernetes on Linux directly, and virtualized environments are generally best-effort for development.

Do you have any idea where the packets are being dropped?

@squeed squeed added the need-more-info More information is required to further debug or fix the issue. label May 16, 2024
@Timen-GitHub-User
Copy link
Author

Thanks for your thorough explanation! However, we generally shy away from such complex topologies, as things can easily go wrong.

I don't know that any of us have enough experience with WSL and docker on WSL. At the end of the day, our primary use-case is Kubernetes on Linux directly, and virtualized environments are generally best-effort for development.

Do you have any idea where the packets are being dropped?

Installed tshark to see if any packets are send from the external workload. looks like the DNS packets are send to the coredns pod:

user@external-vm:~$ sudo tshark -i any -f 'net 10.0.0.0/24'
 ** (tshark:2313) 13:29:35.730322 [Main MESSAGE] -- Capture started.
 ** (tshark:2313) 13:29:35.732398 [Main MESSAGE] -- File: "/tmp/wireshark_any46XSN2.pcapng"
    5 9.339340948 192.168.160.70 → 10.0.0.52    DNS 113 Standard query 0x61e2 A clustermesh-apiserver.kube-system.svc.cluster.local
    6 14.344125352 192.168.160.70 → 10.0.0.52    DNS 113 Standard query 0x61e2 A clustermesh-apiserver.kube-system.svc.cluster.local
    7 15.462168068 192.168.160.70 → 10.0.0.39    TCP 76 [TCP Retransmission] [TCP Port numbers reused] 44960 → 4240 [SYN] Seq=0 Win=64860 Len=0 MSS=1410 SACK_PERM=1 TSval=154786063 TSecr=0 WS=128
    8 19.348757984 192.168.160.70 → 10.0.0.52    DNS 113 Standard query 0x61e2 A clustermesh-apiserver.kube-system.svc.cluster.local
user@ubuntu:~/CiliumEW$ k get pods -n kube-system coredns-7db6d8ff4d-9rb5d -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP          NODE       NOMINATED NODE   READINESS GATES
coredns-7db6d8ff4d-9rb5d   1/1     Running   0          9m12s   10.0.0.52   minikube   <none>           <none>

Using Tsharp on the cluster node doenst seem to capture any packets, and my attempts at starting Tshark in the coredns pod hasn't been succesfull.
It does seem in this case the routing is actually correct, as the 10.0.0.0/24 subnet automatically has been added to the routing table.

About the complexity of the envourment, I also tried it on a AWS vpc with 1 node and 1 external-workload. both ubuntu, docker and the cluster spun up with minikube. I still expierence the same problem.

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 23, 2024
@lmb
Copy link
Contributor

lmb commented May 24, 2024

I agree with squeed that this setup is probably out of scope of the issue tracker. Can you try building a simple reproducer and then asking for help on Slack?

@lmb lmb added the need-more-info More information is required to further debug or fix the issue. label May 24, 2024
@Timen-GitHub-User
Copy link
Author

Timen-GitHub-User commented May 29, 2024

I created a setup with real kubernetes node (not minikube, k3s, kind, etc) in an AWS VPC. added an VM in the same network segment as the node. changed the
cilium clustermesh enable --service-type LoadBalancer --enable-external-workloads to
cilium clustermesh enable --service-type NodePort --enable-external-workloads to simplify the networking setup. workend instandly.

Tried to change the Loadbalancer service to a NodePort on my minikube setup. it wasnt able to connect. didn't do any further troubleshooting. my conclusion: dont use minikube for testing external workloads.

@github-actions github-actions bot removed the need-more-info More information is required to further debug or fix the issue. label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps.
Projects
None yet
Development

No branches or pull requests

3 participants