Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change in behavior of ServiceEntries merge #50478

Closed
3 of 17 tasks
ymesika opened this issue Apr 16, 2024 · 22 comments
Closed
3 of 17 tasks

Change in behavior of ServiceEntries merge #50478

ymesika opened this issue Apr 16, 2024 · 22 comments

Comments

@ymesika
Copy link
Member

ymesika commented Apr 16, 2024

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

The clusters created for merged ServiceEntries have changed in recent Istio releases and in current master it's still broken.

Consider the following two simple ServiceEntry with similar hosts and port names but different port numbers:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: se1
  namespace: test
spec:
  exportTo:
  - .
  hosts:
  - news.google.com
  ports:
  - name: port1
    number: 80
    protocol: HTTP
  resolution: DNS
---
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: se2
  namespace: test
spec:
  exportTo:
  - .
  hosts:
  - news.google.com
  ports:
  - name: port1
    number: 8080
    protocol: HTTP
  resolution: DNS

The behavior is different in the various releases:

  • Up to 1.19.8 the created clusters are one per port without LB between ports:
outbound|80||news.google.com::172.217.22.110:80
outbound|8080||news.google.com::172.217.22.110:8080
  • In 1.20.4 and 1.21.1 there are clusters per port with LB between ports:
outbound|80||news.google.com::172.217.22.110:80
outbound|80||news.google.com::172.217.22.110:8080
outbound|8080||news.google.com::172.217.22.110:80
outbound|8080||news.google.com::172.217.22.110:8080
  • In 1.20.0-1.20.3 and in master there is one cluster for one of the ports which LB:
outbound|80||news.google.com::172.217.22.110:80
outbound|80||news.google.com::172.217.22.110:8080

I believe the behavior of the older releases - cluster per port without LB - is the correct one unless there was a decision to change this behavior.

Version

Versions are detailed in the description

Additional Information

No response

Affected product area

  • Ambient
  • Docs
  • Dual Stack
  • Installation
  • Networking
  • Performance and Scalability
  • Extensions and Telemetry
  • Security
  • Test and Release
  • User Experience
  • Developer Infrastructure
  • Upgrade
  • Multi Cluster
  • Virtual Machine
  • Control Plane Revisions
@ramaraochavali
Copy link
Contributor

@hzxuzhonghu I remember you changed some thing related to service entry merge recently?

@j2gg0s
Copy link
Contributor

j2gg0s commented Apr 16, 2024

Maybe it's a side effect of this PR, https://github.com/istio/istio/pull/49573/files#diff-a8e332e06b003dfbf58c29f37de893fbb0e45f642200ea3ea2faa9847fc7e1e0L1426.

I solved this by set different ports[].name

@hzxuzhonghu
Copy link
Member

I will take a look

@hzxuzhonghu
Copy link
Member

@j2gg0s is right, from debug/endpointsz, we can verify this

@ymesika
Copy link
Member Author

ymesika commented Apr 18, 2024

Right, different port names there are no issues (except 1.20.0-1.20.3 which already got resolved #49489)

@j2gg0s
Copy link
Contributor

j2gg0s commented Apr 18, 2024

My understanding of this.
The actions are primarily caused by changes within the initServiceRegistry.

1.20.4 vs 1.20.3

#49573 has modified this part of the code within the initServiceRegistry.

                shards, ok := env.EndpointIndex.ShardsForService(string(s.Hostname), s.Attributes.Namespace)
                if ok {
-                       ps.ServiceIndex.instancesByPort[svcKey] = shards.CopyEndpoints(portMap)
+                       instancesByPort := shards.CopyEndpoints(portMap)
+                       for port, instances := range instancesByPort {
+                               ps.ServiceIndex.instancesByPort[svcKey][port] = instances
+                       }
                }

The premise for the problem is:
Two ServiceEntry within the same namespace have the same host and portName, but different ports.

So in 1.20.3, we only can see one cluster

1.20.0 vs 1.19.8

#46329 Remove service.InstancesByPort

-               svcKey := s.Key()
-               // Precache instances
+               portMap := map[string]int{}
                for _, port := range s.Ports {
-                       if _, ok := ps.ServiceIndex.instancesByPort[svcKey]; !ok {
-                               ps.ServiceIndex.instancesByPort[svcKey] = make(map[int][]*ServiceInstance)
-                       }
-                       instances := make([]*ServiceInstance, 0)
-                       instances = append(instances, env.InstancesByPort(s, port.Port)...)
-                       ps.ServiceIndex.instancesByPort[svcKey][port.Port] = instances
+                       portMap[port.Name] = port.Port
                }

+               svcKey := s.Key()
+               if _, ok := ps.ServiceIndex.instancesByPort[svcKey]; !ok {
+                       ps.ServiceIndex.instancesByPort[svcKey] = make(map[int][]*IstioEndpoint)
+               }
+               shards, ok := env.EndpointIndex.ShardsForService(string(s.Hostname), s.Attributes.Namespace)
+               if ok {
+                       ps.ServiceIndex.instancesByPort[svcKey] = shards.CopyEndpoints(portMap)
+               }

In 1.19.8's InstancesByPort will filter by port.
So after removing the comparison logic, we will see two endpoints.

// InstancesByPort retrieves instances for a service on the given ports with labels that
// match any of the supplied labels. All instances match an empty tag list.
func (s *Controller) InstancesByPort(svc *model.Service, port int) []*model.ServiceInstance {
	out := make([]*model.ServiceInstance, 0)
	s.mutex.RLock()
	instanceLists := s.serviceInstances.getByKey(instancesKey{svc.Hostname, svc.Attributes.Namespace})
	s.mutex.RUnlock()
	for _, instance := range instanceLists {
		if portMatchSingle(instance, port) {
			out = append(out, instance)
		}
	}

	return out
}

I have not reproduced the issue, so I cannot guarantee that my understanding is correct.

@j2gg0s
Copy link
Contributor

j2gg0s commented Apr 18, 2024

My understanding of this. The actions are primarily caused by changes within the initServiceRegistry.

1.20.4 vs 1.20.3

#49573 has modified this part of the code within the initServiceRegistry.

                shards, ok := env.EndpointIndex.ShardsForService(string(s.Hostname), s.Attributes.Namespace)
                if ok {
-                       ps.ServiceIndex.instancesByPort[svcKey] = shards.CopyEndpoints(portMap)
+                       instancesByPort := shards.CopyEndpoints(portMap)
+                       for port, instances := range instancesByPort {
+                               ps.ServiceIndex.instancesByPort[svcKey][port] = instances
+                       }
                }

The premise for the problem is: Two ServiceEntry within the same namespace have the same host and portName, but different ports.

So in 1.20.3, we only can see one cluster

1.20.0 vs 1.19.8

#46329 Remove service.InstancesByPort

-               svcKey := s.Key()
-               // Precache instances
+               portMap := map[string]int{}
                for _, port := range s.Ports {
-                       if _, ok := ps.ServiceIndex.instancesByPort[svcKey]; !ok {
-                               ps.ServiceIndex.instancesByPort[svcKey] = make(map[int][]*ServiceInstance)
-                       }
-                       instances := make([]*ServiceInstance, 0)
-                       instances = append(instances, env.InstancesByPort(s, port.Port)...)
-                       ps.ServiceIndex.instancesByPort[svcKey][port.Port] = instances
+                       portMap[port.Name] = port.Port
                }

+               svcKey := s.Key()
+               if _, ok := ps.ServiceIndex.instancesByPort[svcKey]; !ok {
+                       ps.ServiceIndex.instancesByPort[svcKey] = make(map[int][]*IstioEndpoint)
+               }
+               shards, ok := env.EndpointIndex.ShardsForService(string(s.Hostname), s.Attributes.Namespace)
+               if ok {
+                       ps.ServiceIndex.instancesByPort[svcKey] = shards.CopyEndpoints(portMap)
+               }

In 1.19.8's InstancesByPort will filter by port. So after removing the comparison logic, we will see two endpoints.

// InstancesByPort retrieves instances for a service on the given ports with labels that
// match any of the supplied labels. All instances match an empty tag list.
func (s *Controller) InstancesByPort(svc *model.Service, port int) []*model.ServiceInstance {
	out := make([]*model.ServiceInstance, 0)
	s.mutex.RLock()
	instanceLists := s.serviceInstances.getByKey(instancesKey{svc.Hostname, svc.Attributes.Namespace})
	s.mutex.RUnlock()
	for _, instance := range instanceLists {
		if portMatchSingle(instance, port) {
			out = append(out, instance)
		}
	}

	return out
}

I have not reproduced the issue, so I cannot guarantee that my understanding is correct.

Considering that ServiceEntry to port.number and port.targetPort maybe different, it is difficult to distinguish abnormal situations within initServiceRegistry.

@hzxuzhonghu
Copy link
Member

We rely on service port name in many places> in K8s service, it must be unique within a service, but it is tricky in istio, because we support merging SEs.

It is not possible to prevent creating two SEs with same port name as they can be created concurrently, But what can be done is warn in istio if there exists port name equals like this case.

@zirain
Copy link
Member

zirain commented Apr 26, 2024

find an issue, following configuration works in 1.18, but fails in 1.20+(not sure about 1.19)

apiVersion: v1
kind: Service
metadata:
  name: httpbin-ext
spec:
  externalName: httpbin.default.svc.cluster.local
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8000
  type: ExternalName
---
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: httpbin-ext
spec:
  hosts:
    - httpbin.default.svc.cluster.local
  location: MESH_EXTERNAL
  ports:
    - name: http
      number: 8000
      protocol: HTTP
  resolution: DNS

in 1.20+, pilot will send eds as following rejected by proxy with error

2024-04-25T12:21:22.487063Z	warning	envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138	gRPC config for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment rejected: malformed IP address: httpbin.default.svc.cluster.local. Consider setting resolver_name or setting cluster type to 'STRICT_DNS' or 'LOGICAL_DNS'	thread=17
{
    "clusterName": "outbound|8000||httpbin.default.svc.cluster.local",
    "endpoints": [
        {
            "locality": {},
            "lbEndpoints": [
                {
                    "endpoint": {
                        "address": {
                            "socketAddress": {
                                "address": "httpbin.default.svc.cluster.local",
                                "portValue": 8000
                            }
                        }
                    },
                    "metadata": {
                        "filterMetadata": {
                            "istio": {
                                "workload": ";;;;"
                            }
                        }
                    },
                    "loadBalancingWeight": 1
                },
                {
                    "endpoint": {
                        "address": {
                            "socketAddress": {
                                "address": "10.244.0.86",
                                "portValue": 8080
                            }
                        }
                    },
                    "healthStatus": "HEALTHY",
                    "metadata": {
                        "filterMetadata": {
                            "envoy.transport_socket_match": {
                                "tlsMode": "istio"
                            },
                            "istio": {
                                "workload": "httpbin;default;httpbin;v1;Kubernetes"
                            }
                        }
                    },
                    "loadBalancingWeight": 1
                }
            ],
            "loadBalancingWeight": 2
        }
    ]
}

@zirain zirain added this to Release Blocker in Prioritization Apr 26, 2024
@hzxuzhonghu
Copy link
Member

Let me investigate this regression

@hzxuzhonghu
Copy link
Member

@howardjohn I kind of remember you changed using the target port of externalName service, which maybe related

@howardjohn
Copy link
Member

definitely not the cause of the original issue. maybe @zirain s issue

@zirain
Copy link
Member

zirain commented Apr 26, 2024

before(1.18):

"httpbin.default.svc.cluster.local": {
        "default": {
            "Shards": {
                "Kubernetes/Kubernetes": [
                    {
                        "Labels": {
                            "app": "httpbin",
                            "kubernetes.io/hostname": "envoy-gateway-control-plane",
                            "pod-template-hash": "86b8ffc5ff",
                            "security.istio.io/tlsMode": "istio",
                            "service.istio.io/canonical-name": "httpbin",
                            "service.istio.io/canonical-revision": "v1",
                            "topology.istio.io/cluster": "Kubernetes",
                            "topology.istio.io/network": "",
                            "version": "v1"
                        },
                        "Address": "10.244.0.121",
                        "ServicePortName": "http",
                        "ServiceAccount": "spiffe://cluster.local/ns/default/sa/httpbin",
                        "Network": "",
                        "Locality": {
                            "Label": "",
                            "ClusterID": "Kubernetes"
                        },
                        "EndpointPort": 8080,
                        "LbWeight": 0,
                        "TLSMode": "istio",
                        "Namespace": "default",
                        "WorkloadName": "httpbin",
                        "HostName": "",
                        "SubDomain": "",
                        "HealthStatus": 1,
                        "NodeName": "envoy-gateway-control-plane"
                    }
                ]
            },
            "ServiceAccounts": {
                "spiffe://cluster.local/ns/default/sa/httpbin": {}
            }
        }
    }

after(1.21):

"httpbin.default.svc.cluster.local": {
        "default": {
            "Shards": {
                "External/Kubernetes": [
                    {
                        "Labels": null,
                        "Address": "httpbin.default.svc.cluster.local",
                        "ServicePortName": "http",
                        "ServiceAccount": "",
                        "Network": "",
                        "Locality": {
                            "Label": "",
                            "ClusterID": ""
                        },
                        "EndpointPort": 8000,
                        "LbWeight": 0,
                        "TLSMode": "disabled",
                        "Namespace": "",
                        "WorkloadName": "",
                        "HostName": "",
                        "SubDomain": "",
                        "HealthStatus": 0,
                        "NodeName": ""
                    }
                ],
                "Kubernetes/Kubernetes": [
                    {
                        "Labels": {
                            "app": "httpbin",
                            "kubernetes.io/hostname": "envoy-gateway-control-plane",
                            "pod-template-hash": "86b8ffc5ff",
                            "security.istio.io/tlsMode": "istio",
                            "service.istio.io/canonical-name": "httpbin",
                            "service.istio.io/canonical-revision": "v1",
                            "topology.istio.io/cluster": "Kubernetes",
                            "topology.istio.io/network": "",
                            "version": "v1"
                        },
                        "Address": "10.244.0.117",
                        "ServicePortName": "http",
                        "ServiceAccount": "spiffe://cluster.local/ns/default/sa/httpbin",
                        "Network": "",
                        "Locality": {
                            "Label": "",
                            "ClusterID": "Kubernetes"
                        },
                        "EndpointPort": 8080,
                        "LbWeight": 0,
                        "TLSMode": "istio",
                        "Namespace": "default",
                        "WorkloadName": "httpbin",
                        "HostName": "",
                        "SubDomain": "",
                        "HealthStatus": 1,
                        "NodeName": "envoy-gateway-control-plane"
                    }
                ]
            },
            "ServiceAccounts": {
                "spiffe://cluster.local/ns/default/sa/httpbin": {}
            }
        }
    }

@ymesika
Copy link
Member Author

ymesika commented Apr 26, 2024

@zirain doesn't it deserve its own different issue?

@zirain
Copy link
Member

zirain commented Apr 26, 2024

@ymesika separated into #50688

@hzxuzhonghu
Copy link
Member

hzxuzhonghu commented Apr 26, 2024

There maybe some bug with delta cluster builder.

If i create the Ses after a sidecar started, it will only see a cluster

root@kurator-linux-0001:~/go/src/istio.io/istio# istioctl pc cluster sleep-7656cf8794-srf59   |grep google
news.google.com                                         8080      -          outbound      STRICT_DNS    

But if i create the ses before the sidecar started, ii will get two clusters

root@kurator-linux-0001:~/go/src/istio.io/istio# k get pod
NAME                       READY   STATUS    RESTARTS   AGE
httpbin-86b8ffc5ff-b5cx9   2/2     Running   0          91m
sleep-7656cf8794-srf59     2/2     Running   0          35s

@howardjohn
Copy link
Member

There maybe some bug with delta cluster builder.

If i create the Ses after a sidecar started, it will only see a cluster

root@kurator-linux-0001:~/go/src/istio.io/istio# istioctl pc cluster sleep-7656cf8794-srf59   |grep google
news.google.com                                         8080      -          outbound      STRICT_DNS    

But if i create the ses before the sidecar started, ii will get two clusters

root@kurator-linux-0001:~/go/src/istio.io/istio# k get pod
NAME                       READY   STATUS    RESTARTS   AGE
httpbin-86b8ffc5ff-b5cx9   2/2     Running   0          91m
sleep-7656cf8794-srf59     2/2     Running   0          35s

👍 I see the same

howardjohn added a commit to howardjohn/istio that referenced this issue Apr 26, 2024
@howardjohn
Copy link
Member

There maybe some bug with delta cluster builder.

Good find, this is a regression in 1.22. Fixed it up in #50712

@zirain
Copy link
Member

zirain commented Apr 29, 2024

@howardjohn @hzxuzhonghu is this fixed by #50691?

@howardjohn
Copy link
Member

No, by #50711. I am finishing up tests, will be done in an hour

@zirain
Copy link
Member

zirain commented May 9, 2024

@howardjohn I think this's fixed?

@howardjohn
Copy link
Member

Yep should be good

Prioritization automation moved this from Release Blocker to Done May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

7 participants