Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the cluster updates are not flowing to all the gateway pods #50670

Open
2 tasks done
bseenu opened this issue Apr 25, 2024 · 14 comments
Open
2 tasks done

the cluster updates are not flowing to all the gateway pods #50670

bseenu opened this issue Apr 25, 2024 · 14 comments

Comments

@bseenu
Copy link

bseenu commented Apr 25, 2024

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

bash-3.2$  istioctl ps|egrep -i istio-ingressgateway
istio-ingressgateway-5dd4cc58d8-8r9b7.istio-system                                        Kubernetes     NOT SENT     SYNCED     SYNCED       SYNCED       NOT SENT     istiod-5756dc65b7-sw8sc     1.19.7
istio-ingressgateway-5dd4cc58d8-bpnc9.istio-system                                        Kubernetes     NOT SENT     SYNCED     SYNCED       SYNCED       NOT SENT     istiod-5756dc65b7-pl9ns     1.19.7
istio-ingressgateway-5dd4cc58d8-m9z8m.istio-system                                        Kubernetes     SYNCED       SYNCED     SYNCED       SYNCED       NOT SENT     istiod-5756dc65b7-42jmx     1.19.7

Version

bash-3.2$ istioctl version
client version: 1.19.7
control plane version: 1.19.7
data plane version: 1.17.3 (12 proxies), 1.18.3 (28 proxies), 1.18.5 (287 proxies), 1.19.7 (577 proxies)

bash-3.2$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.11-eks-b9c9ed7
WARNING: version difference between client (1.29) and server (1.27) exceeds the supported minor version skew of +/-1

Additional Information

bash-3.2$ kubectl exec istiod-5756dc65b7-pl9ns -n istio-system -- curl -s 'http://localhost:8080/debug/config_dump?proxyID=istio-ingressgateway-5dd4cc58d8-8r9b7.istio-system'|jq -r '.configs[]|keys'
[
  "@type"
]
[
  "@type",
  "versionInfo"
]
[
  "@type",
  "dynamicListeners",
  "versionInfo"
]
[
  "@type"
]
[
  "@type",
  "dynamicRouteConfigs"
]
[
  "@type",
  "dynamicActiveSecrets"
]
[
  "@type"
]


bash-3.2$ kubectl exec istiod-5756dc65b7-pl9ns -n istio-system -- curl -s 'http://localhost:8080/debug/config_dump?proxyID=istio-ingressgateway-5dd4cc58d8-m9z8m.istio-system'|jq -r '.configs[]|keys'
[
  "@type"
]
[
  "@type",
  "dynamicActiveClusters",
  "versionInfo"
]
[
  "@type",
  "dynamicListeners",
  "versionInfo"
]
[
  "@type"
]
[
  "@type",
  "dynamicRouteConfigs"
]
[
  "@type",
  "dynamicActiveSecrets"
]
[
  "@type"
]

I have fixed this by restarting ( deleting ) the gateway pods which do not have the updated cluster info, Looking at the proxy logs the last CDS update happened like 20 days back, why does this happen ? how it can be handled ?

@ldemailly
Copy link
Contributor

Were istiod-5756dc65b7-pl9ns and istiod-5756dc65b7-sw8sc (not sent) different in any way from istiod-5756dc65b7-42jmx (synced) ?
Any error or issues in their logs?

Can you reproduce this if everyone is running 1.19.10 ? I see you have 1.17.3 (12 proxies) - that's more than 2 minor version behind

@bseenu
Copy link
Author

bseenu commented Apr 25, 2024

Were istiod-5756dc65b7-pl9ns and istiod-5756dc65b7-sw8sc (not sent) different in any way from istiod-5756dc65b7-42jmx (synced) ? Any error or issues in their logs?

Nothing stood out to me apart from some duplicate serviceentry which was causing errors on all of the istiod pods

"message": "Duplicate cluster outbound|15199||localhost.service.entry found while pushing CDS"

Can you reproduce this if everyone is running 1.19.10 ? I see you have 1.17.3 (12 proxies) - that's more than 2 minor version behind

we do not have control on the proxy pods, they are brought to the new version when their deployment is updated. Not sure if i can reproduce this anyway

@howardjohn
Copy link
Member

Do you have full envoy proxy logs for the proxies with the issue

@bseenu
Copy link
Author

bseenu commented Apr 25, 2024

Attached, [ removed the actual traffic logs and the messages of connected to upstream XDS server istiod ]
log.txt

@howardjohn
Copy link
Member

Oh so they are rejected:

The table size of maglev must be prime

I didn't know about that constraint, and we don't validate it or document it -- but clearly we should

@bseenu
Copy link
Author

bseenu commented Apr 25, 2024

The above problem was fixed it was on one of the destination rule, whose proxy was stuck ( envoy was not coming up ).

@howardjohn
Copy link
Member

Do you mean you fixed the maglev error and still see the original issue of clusters not sent? or both are fixed

@bseenu
Copy link
Author

bseenu commented Apr 26, 2024

Yes, the maglev issue is fixed now, but the cluster updates are still not happening unless the proxies are restarted

@howardjohn
Copy link
Member

Ah got it. Is this happening repeatedly? or you have a few stuck pods and haven't restarted all of them?

if repeatedly - can you send a new log now that the maglev issue is fixed to ensure that wasn't somehow messing with things?

@bseenu
Copy link
Author

bseenu commented Apr 26, 2024

i still have some pods which i have not restarted and are broken

@bseenu
Copy link
Author

bseenu commented Apr 29, 2024

@howardjohn any thing else we can check here ?

@howardjohn
Copy link
Member

@bseenu its a bit hard to tell because it could easily just be because of the maglev issue. It would help if you could reproduce it now that that error isn't present, and include the istiod logs

@PrabhdeepsGill
Copy link

Adding a bit more information about this issue and steps to reproduce the issue:

I looked at two proxies from the same ingress, where one pod was not getting CDS updates sent to it (replica jsnck)

Info from /debug/syncz endpoint about two proxies

 {
    "cluster_id": "Kubernetes",
    "proxy": "istio-internal-ingressgateway-ff7f456dd-252qg.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "224c32c9-6776-4997-8d72-59648893b2c2",
    "cluster_acked": "224c32c9-6776-4997-8d72-59648893b2c2",
    "listener_sent": "3f6b051c-5ca7-4356-92cc-82349d2a785e",
    "listener_acked": "3f6b051c-5ca7-4356-92cc-82349d2a785e",
    "route_sent": "f5ab51b2-25ce-4b43-8928-57c4885391db",
    "route_acked": "f5ab51b2-25ce-4b43-8928-57c4885391db",
    "endpoint_sent": "dc1aa3aa-8672-4eb6-b5fd-3e5172b18a33",
    "endpoint_acked": "dc1aa3aa-8672-4eb6-b5fd-3e5172b18a33"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-internal-ingressgateway-ff7f456dd-jsnck.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "listener_sent": "97f4c48a-b8da-4a3c-8f3a-3161401022a5",
    "listener_acked": "97f4c48a-b8da-4a3c-8f3a-3161401022a5",
    "route_sent": "ef2a6589-3f4e-47a7-bffc-cfe2a88d3e92",
    "route_acked": "ef2a6589-3f4e-47a7-bffc-cfe2a88d3e92",
    "endpoint_sent": "b4e7e726-7a18-4191-b8bb-bf791abb8de9",
    "endpoint_acked": "b4e7e726-7a18-4191-b8bb-bf791abb8de9"
  },

Looking at the envoy config dump from the "stuck" replica, We can see, that it still has the incorrect maglev table size 500 last updated 4/1. Even though we had updated the DestinationRule to maglev table size 547 (prime number as required by envoy doc)

"dynamic_warming_clusters": [
    {
     "version_info": "2024-04-03T22:17:04Z/3068",
     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "outbound|8080||svc-a.namespace-a.svc.cluster.local",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "initial_fetch_timeout": "0s",
        "resource_api_version": "V3"
       },
       "service_name": "outbound|8080||svc-a.namespace-a.svc.cluster.local"
      },
      "connect_timeout": "10s",
      "lb_policy": "MAGLEV",
      "metadata": {
       "filter_metadata": {
        "istio": {
         "services": [
          {
           "name": "svc-a",
           "host": "svc-a.namespace-a.svc.cluster.local",
           "namespace": "namespace-a"
          }
         ],
         "config": "/apis/networking.istio.io/v1alpha3/namespaces/namespace-a/destination-rule/svc-a-transcode-routing"
        }
       }
      },
      "common_lb_config": {
       "locality_weighted_lb_config": {}
      },
      "filters": [
       {
        "name": "istio.metadata_exchange",
        "typed_config": {
         "@type": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
         "protocol": "istio-peer-exchange"
        }
       }
      ],
      "transport_socket_matches": [],
      "maglev_lb_config": {
       "table_size": "500"
      }
     },
     "last_updated": "2024-04-03T22:17:05.266Z"
    }
]

@PrabhdeepsGill
Copy link

PrabhdeepsGill commented Apr 30, 2024

Steps to reproduce this issue.

  1. Starting with all the proxies in synced state, (istio version 1.19.7, also possible to repo in 1.18.x)
istioctl proxy-status  -n istio-system
NAME                                                                       CLUSTER        CDS        LDS        EDS        RDS        ECDS         ISTIOD                      VERSION
alertmanager-kube-prometheus-stack-alertmanager-0.istio-system             Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
grafana-6b7bd6c985-cgvxl.istio-system                                      Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
istio-ingressgateway-774c6b8695-2gz6t.istio-system                         Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
istio-ingressgateway-774c6b8695-lk8jt.istio-system                         Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
...

Relevant snapshot of /debug/sync from istiod

{
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "f3a59072-f389-4f09-876a-c1281254474c",
    "cluster_acked": "f3a59072-f389-4f09-876a-c1281254474c",
    "listener_sent": "4aeb4141-bd97-49f7-a9f3-691fff13bb25",
    "listener_acked": "4aeb4141-bd97-49f7-a9f3-691fff13bb25",
    "route_sent": "838a8e16-8181-4d77-8a64-31f0f9e7eee6",
    "route_acked": "838a8e16-8181-4d77-8a64-31f0f9e7eee6",
    "endpoint_sent": "10d2a8bf-0c70-46a5-8ecb-3f9e8b244089",
    "endpoint_acked": "10d2a8bf-0c70-46a5-8ecb-3f9e8b244089"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
    "cluster_acked": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
    "listener_sent": "d1a184af-89ca-4b05-95c5-267ae1118a5e",
    "listener_acked": "d1a184af-89ca-4b05-95c5-267ae1118a5e",
    "route_sent": "705e2514-e807-43e5-ba96-fa9345f5d750",
    "route_acked": "705e2514-e807-43e5-ba96-fa9345f5d750",
    "endpoint_sent": "28d32732-be43-44a0-8f41-f602975786cf",
    "endpoint_acked": "28d32732-be43-44a0-8f41-f602975786cf"
  },
  1. Create a DestinationRule with Maglev lb example
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata: 
  name: fortio-test-routing
  namespace: fortio
spec: 
  host: fortio-client.fortio.svc.cluster.local
  trafficPolicy: 
    loadBalancer: 
      consistentHash: 
        httpQueryParameterName: url
        maglev: 
          tableSize: 500
  1. CDS on all proxies becomes stale. We see the error The table size of maglev must be prime on all proxies.
    /debug/sync info
{
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "d39434ea-00eb-4016-a490-873944dfc003",
    "cluster_acked": "f3a59072-f389-4f09-876a-c1281254474c",
    "listener_sent": "b7346660-ae73-4574-865c-e9556730ccb7",
    "listener_acked": "b7346660-ae73-4574-865c-e9556730ccb7",
    "route_sent": "ee5cb966-1e65-440b-af6d-5a8163d4e6bb",
    "route_acked": "ee5cb966-1e65-440b-af6d-5a8163d4e6bb",
    "endpoint_sent": "b68fd7c2-07da-4ee1-bce6-87a1e0efc2a7",
    "endpoint_acked": "b68fd7c2-07da-4ee1-bce6-87a1e0efc2a7"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "7954d038-fbb8-499c-8381-88a458efda70",
    "cluster_acked": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
    "listener_sent": "d00c1755-1edb-411f-98ab-86b85e7c871a",
    "listener_acked": "d00c1755-1edb-411f-98ab-86b85e7c871a",
    "route_sent": "7799a716-784e-46ce-b6bb-ab738ecf3d7d",
    "route_acked": "7799a716-784e-46ce-b6bb-ab738ecf3d7d",
    "endpoint_sent": "704fd12d-3941-4da3-ba78-24d5493dc025",
    "endpoint_acked": "704fd12d-3941-4da3-ba78-24d5493dc025"
  },
  1. Restart istiod. And proxies go into state STALE (Never Acknowledged).
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "e4266698-343c-41de-895c-ec390e2f9895",
    "listener_sent": "75f7814f-c4bb-4775-8bff-0affdc375d42",
    "listener_acked": "75f7814f-c4bb-4775-8bff-0affdc375d42",
    "route_sent": "62f4b423-b12c-4a98-83d3-8ad3ea2a2734",
    "route_acked": "62f4b423-b12c-4a98-83d3-8ad3ea2a2734",
    "endpoint_sent": "62ff557f-4bb9-423e-95db-1c0789d2d98c",
    "endpoint_acked": "62ff557f-4bb9-423e-95db-1c0789d2d98c"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "ea8157b8-6b65-40b2-a77d-47ddce075284",
    "listener_sent": "bfaf6443-3264-4c91-8da7-d64c09f60b84",
    "listener_acked": "bfaf6443-3264-4c91-8da7-d64c09f60b84",
    "route_sent": "ab1b8361-9f84-4083-ad96-7d45402da957",
    "route_acked": "ab1b8361-9f84-4083-ad96-7d45402da957",
    "endpoint_sent": "acc4e7c0-cd0b-44e4-98b4-2f28a4f66d43",
    "endpoint_acked": "acc4e7c0-cd0b-44e4-98b4-2f28a4f66d43"
  },
  1. After sometime (~30 mins). Istiod stops sending CDS conf to all the proxies
{
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "listener_sent": "68816399-c1b8-490a-a655-703283417d18",
    "listener_acked": "68816399-c1b8-490a-a655-703283417d18",
    "route_sent": "d311ea57-276e-42e5-b93a-749ab8b46b4d",
    "route_acked": "d311ea57-276e-42e5-b93a-749ab8b46b4d",
    "endpoint_sent": "7b6b697f-b6bd-47cc-8207-e7186582dfb2",
    "endpoint_acked": "7b6b697f-b6bd-47cc-8207-e7186582dfb2"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "listener_sent": "ed77d779-6bed-4b4d-aa9e-6cb4325150e5",
    "listener_acked": "ed77d779-6bed-4b4d-aa9e-6cb4325150e5",
    "route_sent": "20f491fd-c194-4255-bbb4-4bafa0730d23",
    "route_acked": "20f491fd-c194-4255-bbb4-4bafa0730d23",
    "endpoint_sent": "44d5669b-6e82-4701-aa73-e1ad1ec1f82e",
    "endpoint_acked": "44d5669b-6e82-4701-aa73-e1ad1ec1f82e"
  }
  1. Now you can fix the DestinationRule or delete it the proxies will be stuck in this state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants