Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic initiated from the gateway(routers/switches) may got lead to wrong nodes in layer2 mode #359

Open
lwabish opened this issue Sep 21, 2023 · 5 comments

Comments

@lwabish
Copy link

lwabish commented Sep 21, 2023

Describe the bug

After creating a lb service and running the reconcile function of the service, we observe that the arp table in the switch is right and pinging lb ip works fine.

But after 20mins' free of using, the arp info within the arp table is outdated.

If we try to ping the lb ip again, we could find that openelb-manager pod sends arp reply of the right mac addr, but the arp info within the switch shows that :
it uses the node where openelb pod is running as the port to switch packet to, rather than the node annotated in the service.

For example, the arp info is as follow. The mac address xxxx-1b3d-832c is the right one configured, but the BAGGG21 port of the switch is connected to another node, which is running openelb-manager, but not the right switch port matching the mac address.

172.31.11.2 xxxx-1b3d-832c 2002 BAGG21 362 D

To Reproduce
After some digging in this problem, also learned more about arp protocal, finally I understand why this probelm occured.

To reproduce this bug:

  1. setup and configure openelb in layer2 mode according to the docs, make sure lb services can obtain lb ip.
  2. find a lb service, edit its annotation: layer2.openelb.kubesphere.io/v1alpha1, assign a node other than the node where openelb-manager leader is running on.For example, openelb-manager leader is running on node master1, then , we should edit the lb service and make sure the value of annotation layer2.openelb.kubesphere.io/v1alpha1 is not master1.
  3. log in to the router/switch, check the arp table, wait until the dynamic arp got out of date and got deleted.For example, in h3c 6900 series switches, the dynamic arp info learned from openelb gratious arp will be outdated after 20 min if no further usage.
  4. Once we find the dynamic arp is outdated, initiate some traffic from the router/switch to the lb service. For example, ping 172.31.11.2
  5. The visit from step 4 should fail. If we inspect the arp table from the switch, we should find that the port and mac is mismatched, like I mentioned above.

Expected behaviour

Output

Version Info

  • Version of Kubernetes: 1.21
  • Version of OpenELB: 0.4
@renyunkang
Copy link
Member

it uses the node where openelb pod is running as the port to switch packet to, rather than the node annotated in the service.

Which annotation are you using?

@lwabish
Copy link
Author

lwabish commented Sep 22, 2023

it uses the node where openelb pod is running as the port to switch packet to, rather than the node annotated in the service.

Which annotation are you using?

As the code shows, openelb-manager tries to select one of the ready nodes and annotate the service with an annotation called OpenELBLayer2Annotation, which is reused later unless user changes the value.

preNode, ok := svc.Annotations[constant.OpenELBLayer2Annotation]

In my case, I never changed the annotation manually. The node is selected by openelb-manger and the annotation is added by openelb-manager, too.

@lwabish
Copy link
Author

lwabish commented Sep 22, 2023

After communicating with some switch fellows, It seems reasonable that the switch uses the original port of the arp reply as the target port in arp table, even if the mac address within the arp reply does not match with the port.

But why layer2 mode works fine with many other switches? This is confusing.

@lwabish lwabish changed the title Traffic got lead to wrong nodes with h3c switch in layer2 mode Traffic initiated from the gateway(routers/switches) may got lead to wrong nodes in layer2 mode Sep 25, 2023
@lwabish
Copy link
Author

lwabish commented Sep 25, 2023

Combined with how openelb-manager/arp/switches work, the key to this problem is visiting lb services from the gateway when gratious arp has been outdated.

I have find out and updated the procedure to reproduce this problem.

@lwabish
Copy link
Author

lwabish commented Sep 25, 2023

Also ,my colleague has offered a possible solution to this problem, #360

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants