the cluster updates are not flowing to all the gateway pods #50670

bseenu · 2024-04-25T01:25:55Z

Is this the right place to submit this?

This is not a security vulnerability or a crashing bug
This is not a question about how to use Istio

Bug Description

bash-3.2$  istioctl ps|egrep -i istio-ingressgateway
istio-ingressgateway-5dd4cc58d8-8r9b7.istio-system                                        Kubernetes     NOT SENT     SYNCED     SYNCED       SYNCED       NOT SENT     istiod-5756dc65b7-sw8sc     1.19.7
istio-ingressgateway-5dd4cc58d8-bpnc9.istio-system                                        Kubernetes     NOT SENT     SYNCED     SYNCED       SYNCED       NOT SENT     istiod-5756dc65b7-pl9ns     1.19.7
istio-ingressgateway-5dd4cc58d8-m9z8m.istio-system                                        Kubernetes     SYNCED       SYNCED     SYNCED       SYNCED       NOT SENT     istiod-5756dc65b7-42jmx     1.19.7

Version

bash-3.2$ istioctl version
client version: 1.19.7
control plane version: 1.19.7
data plane version: 1.17.3 (12 proxies), 1.18.3 (28 proxies), 1.18.5 (287 proxies), 1.19.7 (577 proxies)

bash-3.2$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.11-eks-b9c9ed7
WARNING: version difference between client (1.29) and server (1.27) exceeds the supported minor version skew of +/-1

Additional Information

bash-3.2$ kubectl exec istiod-5756dc65b7-pl9ns -n istio-system -- curl -s 'http://localhost:8080/debug/config_dump?proxyID=istio-ingressgateway-5dd4cc58d8-8r9b7.istio-system'|jq -r '.configs[]|keys'
[
  "@type"
]
[
  "@type",
  "versionInfo"
]
[
  "@type",
  "dynamicListeners",
  "versionInfo"
]
[
  "@type"
]
[
  "@type",
  "dynamicRouteConfigs"
]
[
  "@type",
  "dynamicActiveSecrets"
]
[
  "@type"
]


bash-3.2$ kubectl exec istiod-5756dc65b7-pl9ns -n istio-system -- curl -s 'http://localhost:8080/debug/config_dump?proxyID=istio-ingressgateway-5dd4cc58d8-m9z8m.istio-system'|jq -r '.configs[]|keys'
[
  "@type"
]
[
  "@type",
  "dynamicActiveClusters",
  "versionInfo"
]
[
  "@type",
  "dynamicListeners",
  "versionInfo"
]
[
  "@type"
]
[
  "@type",
  "dynamicRouteConfigs"
]
[
  "@type",
  "dynamicActiveSecrets"
]
[
  "@type"
]

I have fixed this by restarting ( deleting ) the gateway pods which do not have the updated cluster info, Looking at the proxy logs the last CDS update happened like 20 days back, why does this happen ? how it can be handled ?

The text was updated successfully, but these errors were encountered:

ldemailly · 2024-04-25T02:23:03Z

Were istiod-5756dc65b7-pl9ns and istiod-5756dc65b7-sw8sc (not sent) different in any way from istiod-5756dc65b7-42jmx (synced) ?
Any error or issues in their logs?

Can you reproduce this if everyone is running 1.19.10 ? I see you have 1.17.3 (12 proxies) - that's more than 2 minor version behind

bseenu · 2024-04-25T04:49:05Z

Were istiod-5756dc65b7-pl9ns and istiod-5756dc65b7-sw8sc (not sent) different in any way from istiod-5756dc65b7-42jmx (synced) ? Any error or issues in their logs?

Nothing stood out to me apart from some duplicate serviceentry which was causing errors on all of the istiod pods

"message": "Duplicate cluster outbound|15199||localhost.service.entry found while pushing CDS"

Can you reproduce this if everyone is running 1.19.10 ? I see you have 1.17.3 (12 proxies) - that's more than 2 minor version behind

we do not have control on the proxy pods, they are brought to the new version when their deployment is updated. Not sure if i can reproduce this anyway

howardjohn · 2024-04-25T18:05:26Z

Do you have full envoy proxy logs for the proxies with the issue

bseenu · 2024-04-25T21:20:08Z

Attached, [ removed the actual traffic logs and the messages of connected to upstream XDS server istiod ]
log.txt

howardjohn · 2024-04-25T21:28:51Z

Oh so they are rejected:

The table size of maglev must be prime

I didn't know about that constraint, and we don't validate it or document it -- but clearly we should

bseenu · 2024-04-25T21:54:53Z

The above problem was fixed it was on one of the destination rule, whose proxy was stuck ( envoy was not coming up ).

howardjohn · 2024-04-26T02:20:01Z

Do you mean you fixed the maglev error and still see the original issue of clusters not sent? or both are fixed

bseenu · 2024-04-26T02:52:48Z

Yes, the maglev issue is fixed now, but the cluster updates are still not happening unless the proxies are restarted

howardjohn · 2024-04-26T02:58:57Z

Ah got it. Is this happening repeatedly? or you have a few stuck pods and haven't restarted all of them?

if repeatedly - can you send a new log now that the maglev issue is fixed to ensure that wasn't somehow messing with things?

bseenu · 2024-04-26T03:02:41Z

i still have some pods which i have not restarted and are broken

bseenu · 2024-04-29T22:01:58Z

@howardjohn any thing else we can check here ?

howardjohn · 2024-04-29T22:42:28Z

@bseenu its a bit hard to tell because it could easily just be because of the maglev issue. It would help if you could reproduce it now that that error isn't present, and include the istiod logs

PrabhdeepsGill · 2024-04-30T08:27:01Z

Adding a bit more information about this issue and steps to reproduce the issue:

I looked at two proxies from the same ingress, where one pod was not getting CDS updates sent to it (replica jsnck)

Info from /debug/syncz endpoint about two proxies

 {
    "cluster_id": "Kubernetes",
    "proxy": "istio-internal-ingressgateway-ff7f456dd-252qg.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "224c32c9-6776-4997-8d72-59648893b2c2",
    "cluster_acked": "224c32c9-6776-4997-8d72-59648893b2c2",
    "listener_sent": "3f6b051c-5ca7-4356-92cc-82349d2a785e",
    "listener_acked": "3f6b051c-5ca7-4356-92cc-82349d2a785e",
    "route_sent": "f5ab51b2-25ce-4b43-8928-57c4885391db",
    "route_acked": "f5ab51b2-25ce-4b43-8928-57c4885391db",
    "endpoint_sent": "dc1aa3aa-8672-4eb6-b5fd-3e5172b18a33",
    "endpoint_acked": "dc1aa3aa-8672-4eb6-b5fd-3e5172b18a33"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-internal-ingressgateway-ff7f456dd-jsnck.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "listener_sent": "97f4c48a-b8da-4a3c-8f3a-3161401022a5",
    "listener_acked": "97f4c48a-b8da-4a3c-8f3a-3161401022a5",
    "route_sent": "ef2a6589-3f4e-47a7-bffc-cfe2a88d3e92",
    "route_acked": "ef2a6589-3f4e-47a7-bffc-cfe2a88d3e92",
    "endpoint_sent": "b4e7e726-7a18-4191-b8bb-bf791abb8de9",
    "endpoint_acked": "b4e7e726-7a18-4191-b8bb-bf791abb8de9"
  },

Looking at the envoy config dump from the "stuck" replica, We can see, that it still has the incorrect maglev table size 500 last updated 4/1. Even though we had updated the DestinationRule to maglev table size 547 (prime number as required by envoy doc)

"dynamic_warming_clusters": [
    {
     "version_info": "2024-04-03T22:17:04Z/3068",
     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "outbound|8080||svc-a.namespace-a.svc.cluster.local",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "initial_fetch_timeout": "0s",
        "resource_api_version": "V3"
       },
       "service_name": "outbound|8080||svc-a.namespace-a.svc.cluster.local"
      },
      "connect_timeout": "10s",
      "lb_policy": "MAGLEV",
      "metadata": {
       "filter_metadata": {
        "istio": {
         "services": [
          {
           "name": "svc-a",
           "host": "svc-a.namespace-a.svc.cluster.local",
           "namespace": "namespace-a"
          }
         ],
         "config": "/apis/networking.istio.io/v1alpha3/namespaces/namespace-a/destination-rule/svc-a-transcode-routing"
        }
       }
      },
      "common_lb_config": {
       "locality_weighted_lb_config": {}
      },
      "filters": [
       {
        "name": "istio.metadata_exchange",
        "typed_config": {
         "@type": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
         "protocol": "istio-peer-exchange"
        }
       }
      ],
      "transport_socket_matches": [],
      "maglev_lb_config": {
       "table_size": "500"
      }
     },
     "last_updated": "2024-04-03T22:17:05.266Z"
    }
]

PrabhdeepsGill · 2024-04-30T11:45:23Z

Steps to reproduce this issue.

Starting with all the proxies in synced state, (istio version 1.19.7, also possible to repo in 1.18.x)

istioctl proxy-status  -n istio-system
NAME                                                                       CLUSTER        CDS        LDS        EDS        RDS        ECDS         ISTIOD                      VERSION
alertmanager-kube-prometheus-stack-alertmanager-0.istio-system             Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
grafana-6b7bd6c985-cgvxl.istio-system                                      Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
istio-ingressgateway-774c6b8695-2gz6t.istio-system                         Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
istio-ingressgateway-774c6b8695-lk8jt.istio-system                         Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED     NOT SENT     istiod-55575fbb98-7rk66     1.19.7
...

Relevant snapshot of /debug/sync from istiod

{
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "f3a59072-f389-4f09-876a-c1281254474c",
    "cluster_acked": "f3a59072-f389-4f09-876a-c1281254474c",
    "listener_sent": "4aeb4141-bd97-49f7-a9f3-691fff13bb25",
    "listener_acked": "4aeb4141-bd97-49f7-a9f3-691fff13bb25",
    "route_sent": "838a8e16-8181-4d77-8a64-31f0f9e7eee6",
    "route_acked": "838a8e16-8181-4d77-8a64-31f0f9e7eee6",
    "endpoint_sent": "10d2a8bf-0c70-46a5-8ecb-3f9e8b244089",
    "endpoint_acked": "10d2a8bf-0c70-46a5-8ecb-3f9e8b244089"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
    "cluster_acked": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
    "listener_sent": "d1a184af-89ca-4b05-95c5-267ae1118a5e",
    "listener_acked": "d1a184af-89ca-4b05-95c5-267ae1118a5e",
    "route_sent": "705e2514-e807-43e5-ba96-fa9345f5d750",
    "route_acked": "705e2514-e807-43e5-ba96-fa9345f5d750",
    "endpoint_sent": "28d32732-be43-44a0-8f41-f602975786cf",
    "endpoint_acked": "28d32732-be43-44a0-8f41-f602975786cf"
  },

Create a DestinationRule with Maglev lb example

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata: 
  name: fortio-test-routing
  namespace: fortio
spec: 
  host: fortio-client.fortio.svc.cluster.local
  trafficPolicy: 
    loadBalancer: 
      consistentHash: 
        httpQueryParameterName: url
        maglev: 
          tableSize: 500

CDS on all proxies becomes stale. We see the error The table size of maglev must be prime on all proxies.
/debug/sync info

{
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "d39434ea-00eb-4016-a490-873944dfc003",
    "cluster_acked": "f3a59072-f389-4f09-876a-c1281254474c",
    "listener_sent": "b7346660-ae73-4574-865c-e9556730ccb7",
    "listener_acked": "b7346660-ae73-4574-865c-e9556730ccb7",
    "route_sent": "ee5cb966-1e65-440b-af6d-5a8163d4e6bb",
    "route_acked": "ee5cb966-1e65-440b-af6d-5a8163d4e6bb",
    "endpoint_sent": "b68fd7c2-07da-4ee1-bce6-87a1e0efc2a7",
    "endpoint_acked": "b68fd7c2-07da-4ee1-bce6-87a1e0efc2a7"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "7954d038-fbb8-499c-8381-88a458efda70",
    "cluster_acked": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
    "listener_sent": "d00c1755-1edb-411f-98ab-86b85e7c871a",
    "listener_acked": "d00c1755-1edb-411f-98ab-86b85e7c871a",
    "route_sent": "7799a716-784e-46ce-b6bb-ab738ecf3d7d",
    "route_acked": "7799a716-784e-46ce-b6bb-ab738ecf3d7d",
    "endpoint_sent": "704fd12d-3941-4da3-ba78-24d5493dc025",
    "endpoint_acked": "704fd12d-3941-4da3-ba78-24d5493dc025"
  },

Restart istiod. And proxies go into state STALE (Never Acknowledged).

  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "e4266698-343c-41de-895c-ec390e2f9895",
    "listener_sent": "75f7814f-c4bb-4775-8bff-0affdc375d42",
    "listener_acked": "75f7814f-c4bb-4775-8bff-0affdc375d42",
    "route_sent": "62f4b423-b12c-4a98-83d3-8ad3ea2a2734",
    "route_acked": "62f4b423-b12c-4a98-83d3-8ad3ea2a2734",
    "endpoint_sent": "62ff557f-4bb9-423e-95db-1c0789d2d98c",
    "endpoint_acked": "62ff557f-4bb9-423e-95db-1c0789d2d98c"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "cluster_sent": "ea8157b8-6b65-40b2-a77d-47ddce075284",
    "listener_sent": "bfaf6443-3264-4c91-8da7-d64c09f60b84",
    "listener_acked": "bfaf6443-3264-4c91-8da7-d64c09f60b84",
    "route_sent": "ab1b8361-9f84-4083-ad96-7d45402da957",
    "route_acked": "ab1b8361-9f84-4083-ad96-7d45402da957",
    "endpoint_sent": "acc4e7c0-cd0b-44e4-98b4-2f28a4f66d43",
    "endpoint_acked": "acc4e7c0-cd0b-44e4-98b4-2f28a4f66d43"
  },

After sometime (~30 mins). Istiod stops sending CDS conf to all the proxies

{
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "listener_sent": "68816399-c1b8-490a-a655-703283417d18",
    "listener_acked": "68816399-c1b8-490a-a655-703283417d18",
    "route_sent": "d311ea57-276e-42e5-b93a-749ab8b46b4d",
    "route_acked": "d311ea57-276e-42e5-b93a-749ab8b46b4d",
    "endpoint_sent": "7b6b697f-b6bd-47cc-8207-e7186582dfb2",
    "endpoint_acked": "7b6b697f-b6bd-47cc-8207-e7186582dfb2"
  },
  {
    "cluster_id": "Kubernetes",
    "proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
    "proxy_type": "router",
    "istio_version": "1.19.7",
    "listener_sent": "ed77d779-6bed-4b4d-aa9e-6cb4325150e5",
    "listener_acked": "ed77d779-6bed-4b4d-aa9e-6cb4325150e5",
    "route_sent": "20f491fd-c194-4255-bbb4-4bafa0730d23",
    "route_acked": "20f491fd-c194-4255-bbb4-4bafa0730d23",
    "endpoint_sent": "44d5669b-6e82-4701-aa73-e1ad1ec1f82e",
    "endpoint_acked": "44d5669b-6e82-4701-aa73-e1ad1ec1f82e"
  }

Now you can fix the DestinationRule or delete it the proxies will be stuck in this state.

istio-policy-bot added the area/networking label Apr 25, 2024

howardjohn mentioned this issue Apr 29, 2024

validation: block invalid maglev table sizes #50750

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the cluster updates are not flowing to all the gateway pods #50670

the cluster updates are not flowing to all the gateway pods #50670

bseenu commented Apr 25, 2024 •

edited by istio-policy-bot

ldemailly commented Apr 25, 2024

bseenu commented Apr 25, 2024

howardjohn commented Apr 25, 2024

bseenu commented Apr 25, 2024 •

edited

howardjohn commented Apr 25, 2024

bseenu commented Apr 25, 2024

howardjohn commented Apr 26, 2024

bseenu commented Apr 26, 2024

howardjohn commented Apr 26, 2024

bseenu commented Apr 26, 2024

bseenu commented Apr 29, 2024

howardjohn commented Apr 29, 2024

PrabhdeepsGill commented Apr 30, 2024

PrabhdeepsGill commented Apr 30, 2024 •

edited

the cluster updates are not flowing to all the gateway pods #50670

the cluster updates are not flowing to all the gateway pods #50670

Comments

bseenu commented Apr 25, 2024 • edited by istio-policy-bot

Is this the right place to submit this?

Bug Description

Version

Additional Information

ldemailly commented Apr 25, 2024

bseenu commented Apr 25, 2024

howardjohn commented Apr 25, 2024

bseenu commented Apr 25, 2024 • edited

howardjohn commented Apr 25, 2024

bseenu commented Apr 25, 2024

howardjohn commented Apr 26, 2024

bseenu commented Apr 26, 2024

howardjohn commented Apr 26, 2024

bseenu commented Apr 26, 2024

bseenu commented Apr 29, 2024

howardjohn commented Apr 29, 2024

PrabhdeepsGill commented Apr 30, 2024

PrabhdeepsGill commented Apr 30, 2024 • edited

bseenu commented Apr 25, 2024 •

edited by istio-policy-bot

bseenu commented Apr 25, 2024 •

edited

PrabhdeepsGill commented Apr 30, 2024 •

edited