Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round Robin Load Balancing not working as expected #2151

Open
brendanalexdr opened this issue May 31, 2023 · 15 comments
Open

Round Robin Load Balancing not working as expected #2151

brendanalexdr opened this issue May 31, 2023 · 15 comments
Labels
Type: Bug Something isn't working
Milestone

Comments

@brendanalexdr
Copy link

brendanalexdr commented May 31, 2023

Background

I am attempting to deploy 2 replicas of a simple MyTestWebApp in a Docker Swarm envinroment using the Round Robin config. The purpose is to gain experience before deploying in production and testing the Round Robin config. (Each deployment of MyTestWepApp generates a unique app ID and when a request hits the controller it is logged in the console)

Expected Behavior

For each request to the end point, my YARP implementation will hit one instatiation of MyTestWebApp, then the second instantiation, then back to the first, and so on, in a Round Robin fashion.

The (Possible) Bug

For each request to the end point, my YARP implementation hits one instatiation only of MyTestWebApp, but no requests hit the second instantiation. If I pause making requests for a period of time (maybe 5 minutes or so), the second instantiation may be hit but then the first will not be hit.

My Config

"ReverseProxy": {
"Routes": {
  "route1": {
    "ClusterId": "mytestwebapp",
    "Match": {
      "Path": "{**catch-all}",
      "Hosts": [ "mytestwebapp.dev" ]
    }
  }

},
"Clusters": {
  "mytestwebapp": {
    "LoadBalancingPolicy": "RoundRobin",
    "Destinations": {
      "destination1": {
        "Address": "http://mytestwebapp:5023/"
      }
    }
  }
}

Here is my docker compose file:

version: '3.8'

services:
  tempwebapp:
    image: localstore/tempwebapp:1.3
    environment:
      - ASPNETCORE_URLS=http://*:5023
      - ASPNETCORE_ENVIRONMENT=Production
    ports:
      - 5023:5023
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
      placement:
        constraints: [node.role == manager] 
    networks:
      - localnet
  yarpreverseproxydev:
    image: localstore/yarpreverseproxydev:1.0
    ports:
      - 80:80
      - 443:443
    environment:
      - ASPNETCORE_ENVIRONMENT=Production
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints: [node.role == manager]  
    networks:
      - localnet
networks:
  localnet:
    driver: overlay
    attachable: true
    name: localnet  

Console Logs from each instantiation

From tempwebapp in Containter 1:
From tempwebapp in Containter 1

From tempwebapp in Containter 2:
From tempwebapp in Containter 2

@brendanalexdr brendanalexdr added the Type: Bug Something isn't working label May 31, 2023
@Tratcher
Copy link
Member

RoundRobin operates on Destinations, and you've only supplied one. It sounds like another component is doing DNS or TCP load balancing underneath?

    "LoadBalancingPolicy": "RoundRobin",
    "Destinations": {
      "destination1": {
        "Address": "http://mytestwebapp:5023/"
      }

@brendanalexdr
Copy link
Author

brendanalexdr commented May 31, 2023

RoundRobin operates on Destinations, and you've only supplied one. It sounds like another component is doing DNS or TCP load balancing underneath?

    "LoadBalancingPolicy": "RoundRobin",
    "Destinations": {
      "destination1": {
        "Address": "http://mytestwebapp:5023/"
      }

Ok this is precisely why I was testing. But...in a typical clustered environment, across many nodes, and changing deployment replica counts, how do you configure destinations? So YARP cant do load balancing in a dynamic clustered environment?

FYI, DNS is being handled by windows 11 on my dev box. Got no underlying load balancing going on under the hood. Was thinking YARP would handle this.

@samsp-msft
Copy link
Contributor

Ok this is precisely why I was testing. But...in a typical clustered environment, across many nodes, and changing deployment replica counts, how do you configure destinations? So YARP can't do load balancing in a dynamic clustered environment?

You need a mechanism to resolve the destinations by talking to whatever is doing the dynamic clustering - such as kubernetes, which there is a YARP ingress controller for k8s. One of the reasons we have the extensibility in YARP is to enable customers to write configuration management that will pull the data from their backend systems.

@brendanalexdr
Copy link
Author

Ok I go it. So, basically, if I understand correctly, in the case of my docker swarm test environment, I would need to use something like HaProxy to mediate the round robin with the microscervices.

@samsp-msft
Copy link
Contributor

samsp-msft commented Jun 6, 2023

Docker Swarm is similar to kubernetes in that it manages where the service instances live, and how to route to them. You can either use its build in routing or configure it to export that data via dns.

The part that is missing from YARP is having a DNS provider that will resolve a dns name to its addresses, and regularly poll the dns to check the addresses. YARP's config is a little confusing in that you can specify a destination via a hostname, but we expect that to resolve to a single host.

We need to have a dns provider similar to HAProxy, where you can configure the DNS and names to be resolved. YARP would then actively ping the DNS to update the host list. AFAIK there is not a notification system for DNS, so you need to poll, which means that it will always be a little out of date, depending on how often instances are created and destroyed.

@samsp-msft
Copy link
Contributor

Keep this open in case #2154 doesn't resolve all the issues

@samsp-msft samsp-msft added this to the Backlog milestone Jun 20, 2023
@MayTakeUpTo8Hours
Copy link

Similar issue here.

Background

  • Services running on a kubernetes cluster (AKS)
  • k8s deployment of app has 2 replicas (= 2 pods)
  • k8s service already hides the 2 instances (pods) behind one interface
  • k8s service already does balancing per round-robin

Sketch of the environment:
Zeichnung1

Expected Behavior:

  • YARP forwards traffic to k8s service
  • k8s service takes care of the load-balancing (does round-robin)

Actual Behavior:

  • Requests only hitting on 1 pod
  • After some time of inactivity (something like 5-10 mins) it switches around to only hit the other pod
  • No load-balancing given as no matter the load, only one instance is receiving the requests

When skipping YARP by adding an nginx ingress to go directly onto the service, it works just as expected!
Due to architectural reasons, this sadly does not suffice as a workaround in my case.

The (Possible) Bug

Maybe some sort of keep-alive that is added by YARP and makes the k8s service forward the request to the same pod all the time?
Sadly I'm currently not able to properly capture traffic between YARP and the k8s service.

@Tratcher
Copy link
Member

The k8s Service load balancing is TCP connection based, not HTTP request based, right?

YARP will reuse connections as much as possible, so you'll only get new connections when there is high concurrency. Once there are multiple connections, I assume it still prefers the first one when it's available. This can't really be fixed without moving the load balancing to YARP. The other way is to disable connection re-use but that would cause a number of issues.

@bford1988
Copy link

Hey @Tratcher, we're running into this exact issue using YARP as our API Gateway with destinations pointed to k8s services.

We're about to test disabling the connection re-use in YARP, and I was hoping you could expand on what types of issues we may encounter. Thanks for your attention to this issue.

@Tratcher
Copy link
Member

Disabling connection reuse will cause higher latency, resource usage, and potentially port exhaustion when under heavy load.

@bford1988
Copy link

Thanks a lot for the response @Tratcher, much appreciated. We will try to test with the connection re-use disabled, but as you previously said, this does not seem like a viable option for a prod environment under heavy load.

If we're unable to access the k8s pods directly from YARP to make use of YARP's load balancing, it appears we may be out of options to resolve this k8s service load balancing issue.

Do you know if there are ongoing plans/efforts to release the Yarp.Kubernetes.Controller project or has this been abandoned? https://github.com/microsoft/reverse-proxy/blob/main/docs/docfx/articles/kubernetes-ingress.md

Thanks again!

@Tratcher
Copy link
Member

That's a question for @MihaZupan.

@MihaZupan
Copy link
Member

Have you tried using the new destination resolvers feature?

services.AddReverseProxy()
+   .AddDnsDestinationResolver() // You may have to lower the frequency - default is 5 min

This would expand the list of destinations YARP sees from the hostnames (service names) to all the addresses returned by DNS. If that returned multiple available pods, YARP's round-robin load balancing should rotate between those.

@bford1988
Copy link

bford1988 commented Feb 27, 2024

@MihaZupan Thanks! That's getting us very close. I've added the DestinationResolver, and we also had to add a k8s headless service instead of using our "normal" service as a destination. Now the pod IPs are discoverable and getting set as destinations (as seen from logs added to the DnsDestinationResolver).

It looks like the last hurdle is that the requests are being routed to "PodIpAddress:443" instead of "PodIpAddress:5001". I'm working on resolving this if you have any advice, and then I think we'll have a complete solution. Thank you for the help!

Update: We'll most likely move forward with simply updating the port to 5001 for the pod IP discovered from the k8s headless service hosts. More testing needed, but so far this solution is working.

@bford1988
Copy link

@MihaZupan @Tratcher Thanks for the help. Using the DnsDestinationResolver and k8s headless services as our destinations, we're successfully able to discover and load balance traffic to our k8s pods.

However, when the gateway is under load and a pod is restarting, we receive many 502/504 errors. In testing, we sent 2 requests per 100ms and received roughly 20-30 502/504 errors during a rolling pod restart. The system will be under a much heavier load in production.

We've tried configuring the health checks using a combination of passive and active checks, and we also tried a "FirstFail" health check policy (as outlined in the documentation). No matter how tight we make these policies, it seems it won't be possible to handle pods cycling as well as our original k8s service without YARP (which only receives one or two 504 errors during a pod cycle).

Do you have any recommendations on how to resolve or improve the problem we're seeing? Also, are there any plans to continue working on or release the Yarp.Kubernetes.Controller package ? Thanks for the help, much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants