Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik healthcheck should not pass by default for newly discovered backends #4544

Open
andyspiers opened this issue Feb 28, 2019 · 17 comments · May be fixed by #10494
Open

Traefik healthcheck should not pass by default for newly discovered backends #4544

andyspiers opened this issue Feb 28, 2019 · 17 comments · May be fixed by #10494
Assignees
Labels

Comments

@andyspiers
Copy link

Do you want to request a feature or report a bug?

Bug

What did you do?

Background: we deploy changes to our marathon masters and mesos agents by replacing the AWS instances with CloudFormation rolling updates. We co-locate Traefik on each of the mesos agents. Production environment is about 200 marathon apps, 800 containers over 25 EC2 instances, serving generally around 1K to 4K req/s.

During a rolling update to production, we drop almost 1% of requests. Our aim is to not drop any requests (but we might settle for 0.01%) so in a test environment, we're working through various points of lifecycle of marathon tasks and Traefik configuration to analyse the reasons for dropped requests.

With the help of the great manual - https://docs.traefik.io/user-guide/marathon/ - we have made some progress:

  • we have enabled readiness checks
  • we have patched the mesos containerizer source to implement an extra shutdown delay between entering TASK_KILLING and actually running "docker stop" (as an alternative to requiring to modify the signal handling in 200 microservices)
  • this has given perfect results of zero dropped requests when scaling or deploying a marathon app.

However, when doing a rolling update of the mesos agents the task lifecycle is different.

Tasks are evacuated from each instance and recreated by marathon on other agents without a deployment occurring ... which means that there are no readiness checks to prevent new tasks going live in Traefik before they are ready. This means we receive a lot of 502 errors when Traefik tries to send requests to the newly added backends.

To try and deal with this cause of dropped requests and 502 errors we have enabled Traefik healthchecks on our test service. This gives a significant improvement but we're still dropping some requests.

It seems that this is because when Traefik is triggered (by a marathon event) to reconfigure itself and discovers a new backend, it initially puts it in active rotation .... in other words the initial state of the Traefik healthcheck defaults to "passing".

I think this is arguably a bug and that any newly discovered backend with a healthcheck should initially be presumed to be failing its healthcheck until such time as the healthcheck actually passes.

Output of traefik version: (What version of Traefik are you using?)

Version:      v1.7.2
Codename:     maroilles
Go version:   go1.11.1
Built:        2018-10-04_01:44:36PM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

Some uninteresting stuff like log headers removed

logLevel = "INFO"
defaultEntryPoints = ["http"]

[entryPoints]
    [entryPoints.http]
    address = ":80"

[traefikLog]
  format = "json"

[accessLog]
  format = "json"

# API definition
[api]
  entryPoint = "traefik"
  dashboard = true
  debug = false

  [api.statistics]
    recentErrors = 10

# Metrics definition
[metrics]
  [metrics.datadog]
    address = "xxx.yyy.50.75:8125"
    pushInterval = "10s"

[ping]
entryPoint = "traefik"

[marathon]
endpoint = "http://mesos.our.domain:8080"
watch = true
domain = "our.domain"
exposedByDefault = false
respectReadinessChecks = true

If applicable, please paste the log output in DEBUG level (--logLevel=DEBUG switch)

Test scenario

  • one test marathon application mock-webapp with 3 marathon tasks on 3 out of 5 different mesos agents
  • mock-webapp has an artificial start-up delay of 15s before it opens the socket to listen to simulate java app startup
  • mock-webapp then answers requests with an artificial delay of 100-200ms to simulate real world service latency
  • AWS ALB distributing HTTP requests over 5 Traefik instances, one on each mesos agent
  • we are firing load at the ALB -> Traefik at 75 req/s
  • we enabled debug logging on one Traefik process and here is an extract excluding the successful requests (the whole thing is a few thousand lines)
  • traefik healthchecks are enabled every 10s
<marathon event received>
12:32:53.319  Creating server server-mock-webapp-mock-webapp-1809d998-3b4d-11e9-a7e1-2ef24d6dcd47 at http://xxx.yyy.52.15:31658 with weight 1
12:32:53.319  Creating server server-mock-webapp-mock-webapp-f726fe2d-3b54-11e9-bc7d-4a6ed5e7b0d0 at http://xxx.yyy.50.75:31518 with weight 1
12:32:53.319  Creating server server-mock-webapp-mock-webapp-65eecc6b-3b4d-11e9-a7e1-2ef24d6dcd47 at http://xxx.yyy.52.119:31928 with weight 1
12:32:53.319  Creating backend backend-mock-webapp
12:32:53.319  Setting up backend health check [Hostname:  Headers: map[] Path: /?max_latency=0.1 Port: 0 Interval: 10s]
12:32:53.319  Wiring frontend frontend-mock-webapp to entryPoint http
12:32:53.320  Creating route route-host-mock-webapp Host: mock-webapp.domain
12:32:53.326  Stopping current health check goroutines of backend: backend-mock-webapp
12:32:53.326  Initial health check for backend: "backend-mock-webapp"
12:32:53.586  '502 Bad Gateway' caused by: dial tcp xxx.yyy.50.75:31518: connect: connection refused
<there are also a few more 502 errors>

12:32:54.000	Health check failed: Remove from server list. Backend: "backend-mock-webapp" URL: "http://xxx.yyy.50.75:31518" Reason: HTTP request failed: Get http://xxx.yyy.50.75:31518/%3Fmax_latency=0.1: dial tcp xxx.yyy.50.75:31518: connect: connection refused
12:33:04.000  Refreshing health check for backend: backend-mock-webapp
12:33:04.001  Health check still failing. Backend: "backend-mock-webapp" URL: "http://xxx.yyy.50.75:31518" Reason: HTTP request failed: Get http://xxx.yyy.50.75:31518/%3Fmax_latency=0.1: dial tcp xxx.yyy.50.75:31518: connect: connection refused
12:33:09.007  Received provider event type: health_status_changed_event, event: &{health_status_changed_event 2019-02-28T12:33:09.005Z /paas/mock-webapp  2019-02-28T11:36:31.271Z %!s(bool=true)}
<traefik reconfigures itself again 2 times, no changes>

12:33:14.000  Refreshing health check for backend: backend-mock-webapp
12:33:14.571  Health check up: Returning to server list. Backend: "backend-mock-webapp" URL: "http://xxx.yyy.50.75:31518"

So you can see that for almost 700ms between 12:32:53.319 and 12:32:54.000 the backend xxx.yyy.50.75:31518 is presumed to be healthy and is being sent requests, until such time as the first healthcheck request fails, and then it is marked unhealthy.

During this period, we unavoidably get 502 errors because Traefik is sending requests to a marathon task which is not yet listening on the port.

At 12:33:09 you can see the task go healthy in marathon.
At 12:33:14 this traefik node sees the service as healthy and puts it back in rotation

Here's a log of all the 502 errors received across all 5 Traefik + mesos-agent nodes:

StartUTC      OriginStatus  Host                   Duration
12:32:53.316  502           mesos-agent-private-4  1074858
12:32:53.364  502           mesos-agent-private-2  957254
12:32:53.396  502           mesos-agent-private-3  1081129
12:32:53.421  502           mesos-agent-private-1  588835
12:32:53.476  502           mesos-agent-private-3  1041597
12:32:53.499  502           mesos-agent-private-1  753561
12:32:53.583  502           mesos-agent-private-6  2501728
12:32:53.621  502           mesos-agent-private-2  1017768
12:32:53.633  502           mesos-agent-private-4  1569354
12:32:53.653  502           mesos-agent-private-4  669198
12:32:53.729  502           mesos-agent-private-2  935705
12:32:53.754  502           mesos-agent-private-6  2740235
12:32:53.757  502           mesos-agent-private-3  1081155
12:32:53.869  502           mesos-agent-private-3  1064540
12:32:53.933  502           mesos-agent-private-6  1709245
12:32:53.968  502           mesos-agent-private-4  712834
12:32:54.032  502           mesos-agent-private-2  1064561
12:32:54.033  502           mesos-agent-private-4  772085
12:32:54.073  502           mesos-agent-private-2  984483
12:32:54.140  502           mesos-agent-private-3  1175981
12:32:54.376  502           mesos-agent-private-3  1141847
12:32:54.461  502           mesos-agent-private-3  1017805

You can see that the total period of errors is around 1145ms - presumably each individual node received the marathon event at a slightly different times and took a different amount of time to reconfigure and schedule the first healthcheck.

Before and after this period, we receive only 200 responses.

what could be done differently?

I think that the ideal behaviour here would be that the new backend that has not yet been healthchecked is marked as unhealthy until the first successful healthcheck result is processed.

  • would you agree?
  • do you think this is feasible to fix?
  • is there any other way that we could work around this issue?

Thanks and sorry for the very long bug report!

@timoreimann
Copy link
Contributor

Hi @andyspiers,

thanks for your bug report.

Marking Marathon tasks as unhealthy for which no health check result has been collected yet has been requested a few times in the past. The major problem is that certain scenarios ask for the inverted logic: specifically, when a new leader is elected, all health check results are reset and need to be recomputed first. If the default health state was unhealthy in such cases, all running tasks would immediately start to generate errors until health check states were recovered. In fact, that's the behavior Traefik exhibited some time ago but was changed to address the associated uptime concerns. A section in the Marathon guide also touches on the subject.

Now, I should explain that I have stopped working with Marathon/Mesos at least one year ago. I'm not sure if the situation has possibly changed on Marathon's end (for instance, whether the cluster shares state on health checks results across all masters). Also, I don't have experience with Mesos-based health checks, only Marathon-based ones which may or may not make a difference.

What I do remember is that we frequently rolling-restarted our Mesos clusters for upgrades and other reasons, and managed to maintain full service uptime (at least for those applications that implemented proper graceful termination handlers and readiness checks). I can't recall anymore on how we did it exactly, but I believe we never touched agents directly but replaced nodes as a whole (immutable infrastructure). @gottwald, could you chime in if your memory is better than mine on this one?

Either way, I'd be very open to discuss ideas you may have on how we could address your case without regressing on the scenarios described above.

Thanks!

@andyspiers
Copy link
Author

Hi Timo,

Thanks for your response and apologies for the long time in replying to you.

To clarify, I'm not talking about modifying the behaviour of the Marathon provider. We have read the Traefik docs and understand that on Marathon leader election the healthcheck status becomes unknown. We did a test to confirm if this is still the case and yes, it is.

During a marathon master re-election, healthCheckResults[] went empty (unknown)

  • for approx 10 seconds for Marathon (HTTP) checks
  • for approx 1 second for Mesos (MESOS_HTTP) checks

This means they are still unusable for Traefik, as expected.

The issue/bug report is about Traefik healthchecks not Marathon healthchecks:

Is there a way that when a new backend server is discovered and populated into the Traefik config AND it has a Traefik healthcheck configured, that this service is populated but the initial state of the Traefik healthcheck is presumed unhealthy and the new server will only then receive requests after a successful Traefik healthcheck result?

Hope this clarifies the issue

Thanks,
Andy

@andyspiers
Copy link
Author

.... possibly i shouldn't have mentioned marathon :) - this may be a generalised issue for Traefik healthchecks for any type of provider

Thanks,
Andy

@a-nldisr
Copy link

a-nldisr commented May 27, 2019

@andyspiers Have you tried setting the following label in Marathon:
traefik.backend.healthcheck.path

It could be that it does not work in current latest version, because of this report:
#3417

@timoreimann
Copy link
Contributor

Oh darn, this one fell completely off my radar, so sorry!

@andyspiers sorry for the confusion on my end, I believe you have a good point. Happy to see if adjustments on Traefik's end are reasonable to make if the matter is still important to you.

@andyspiers
Copy link
Author

@a-nldisr thanks for the input - I'll bear that in mind but that's not the issue in our particular case as a healthcheck to / is valid.

@timoreimann - thanks for getting back to me and no worries, I know how it is :)

I've also been very busy with other stuff so I haven't chased this up but it is still relevant to us, and I would still consider it a bug that a new backend (with a healthcheck defined) is populated and presumed to be healthy before the healthcheck has actually run for the first time.

If you're able to look into adjusting Traefik's behaviour in this area that would be awesome, thanks again.

Andy

@timoreimann
Copy link
Contributor

timoreimann commented Jun 13, 2019

@andyspiers based on our and additional discussions I've had with other maintainers, it seems reasonable to allow for the default health check state to be unhealthy as you requested. Popular management platforms (like Kubernetes and Docker) also default to that state.

Additionally, there doesn't seem to be any harm in changing the default behavior (as opposed to making it configurable) except for a slight increase in delay. If that's really problematic to anybody, we can contemplate adding parameters to tune health check behavior (e.g., periodicity and initial delay) where we do not have them yet. This would be beneficial beyond the issue at hand.

I can drive the necessary change myself when I have some free bandwidth. If anyone is happy to pick up the PR effort, however, I won't resist much. :)

Thanks!

@timoreimann timoreimann changed the title healthcheck should not pass by default for newly discovered marathon backends Traefik healthcheck should not pass by default for newly discovered backends Jun 13, 2019
@andyspiers
Copy link
Author

cool - thanks so much @timoreimann for digging into this :)

@tennox
Copy link

tennox commented Dec 6, 2019

Is there any way to configure this now?
A flag to set assume-healthy-at-start=false as long as you don't change default behaviour?

Or do you have hints for me to submit a PR? 😅

@dduportal dduportal added this to issues in v2 via automation Dec 6, 2019
@dduportal dduportal removed this from issues in v2 Dec 6, 2019
@andyspiers
Copy link
Author

AFAIK @tennox nobody has started work on this

@jasonhildebrand
Copy link

I am testing a blue/green deployment scenario using the docker backend, aiming for zero dropped requests, and am hitting this same issue. The new docker container is immediately considered healthy by traefik and is sent requests before the container is fully initialized and able to process requests.

The proposed fix (allowing for the default health check state to be unhealthy) would solve this problem for me as well.

@timoreimann
Copy link
Contributor

I won't have time to tackle this one in the foreseeable future myself -- whoever wants to drive this is more than welcome!

@diegows
Copy link

diegows commented Jun 9, 2020

Hi @timoreimann!

I'd like to contribute here. I need to fix this for 1.7, because I need support for dynamodb and ecs. I might try to fix 2.2 later. Let me see if you can help to confirm some assumptions and help me with some ideas :) please.

If I understand correctly, every time there is a dynamic configuration update, all the configuration is replaced with the information from ConfigMessage. So, Traefik is not storing information from previous configuration, just parsing and replacing old config.

To implement "default false" behavior, I'm thinking on adding all the backend urls to backend.disabledURLs by default at start time. Healthcheck should move them to the Servers list. The tricky part is to store the healthy urls, so we keep them in the Servers list on config reload. I don't see what's the right place to store this if we are replacing the whole config on each config update. Do you have any hint? Not sure what's the right global place to store this info.

@nokjuh
Copy link

nokjuh commented Sep 9, 2021

This is also a problem with Azure Service Fabric. Service Fabric publishes container endpoints immediately as the container starts, and we have not found any way to prevent it from doing this. This leads to Traefik routing traffic to the endpoints until Traefik healthcheck then runs and recognizes they are down and removes them.

It would be great if Traefik would treat the endpoints as unhealthy until a health check has passed, or would at least provide an option in health checks to choose this behavior.

@bswinnerton
Copy link

👋 Curious if anyone knows any workarounds to resolve brief moments of requests being routed to unhealthy servers before the health check runs?

@bswinnerton
Copy link

Is there appetite to change the default behavior here from up to down? Looking through the load balancer code I think it may be derived from this line:

Alternatively, this could be implemented closer to the Docker provider

bswinnerton added a commit to bswinnerton/traefik that referenced this issue Mar 3, 2024
By default health checks are up, which is problematic for new backends that may not yet be healthy. This commit updates the health check code to default to down. 

Fixes traefik#4544.
@bswinnerton bswinnerton linked a pull request Mar 3, 2024 that will close this issue
2 tasks
@xebug
Copy link

xebug commented Mar 4, 2024

any workaround ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.