Traefik healthcheck should not pass by default for newly discovered backends #4544

andyspiers · 2019-02-28T17:29:54Z

Do you want to request a feature or report a bug?

Bug

What did you do?

Background: we deploy changes to our marathon masters and mesos agents by replacing the AWS instances with CloudFormation rolling updates. We co-locate Traefik on each of the mesos agents. Production environment is about 200 marathon apps, 800 containers over 25 EC2 instances, serving generally around 1K to 4K req/s.

During a rolling update to production, we drop almost 1% of requests. Our aim is to not drop any requests (but we might settle for 0.01%) so in a test environment, we're working through various points of lifecycle of marathon tasks and Traefik configuration to analyse the reasons for dropped requests.

With the help of the great manual - https://docs.traefik.io/user-guide/marathon/ - we have made some progress:

we have enabled readiness checks
we have patched the mesos containerizer source to implement an extra shutdown delay between entering TASK_KILLING and actually running "docker stop" (as an alternative to requiring to modify the signal handling in 200 microservices)
this has given perfect results of zero dropped requests when scaling or deploying a marathon app.

However, when doing a rolling update of the mesos agents the task lifecycle is different.

Tasks are evacuated from each instance and recreated by marathon on other agents without a deployment occurring ... which means that there are no readiness checks to prevent new tasks going live in Traefik before they are ready. This means we receive a lot of 502 errors when Traefik tries to send requests to the newly added backends.

To try and deal with this cause of dropped requests and 502 errors we have enabled Traefik healthchecks on our test service. This gives a significant improvement but we're still dropping some requests.

It seems that this is because when Traefik is triggered (by a marathon event) to reconfigure itself and discovers a new backend, it initially puts it in active rotation .... in other words the initial state of the Traefik healthcheck defaults to "passing".

I think this is arguably a bug and that any newly discovered backend with a healthcheck should initially be presumed to be failing its healthcheck until such time as the healthcheck actually passes.

Output of `traefik version`: (What version of Traefik are you using?)

Version:      v1.7.2
Codename:     maroilles
Go version:   go1.11.1
Built:        2018-10-04_01:44:36PM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

Some uninteresting stuff like log headers removed

logLevel = "INFO"
defaultEntryPoints = ["http"]

[entryPoints]
    [entryPoints.http]
    address = ":80"

[traefikLog]
  format = "json"

[accessLog]
  format = "json"

# API definition
[api]
  entryPoint = "traefik"
  dashboard = true
  debug = false

  [api.statistics]
    recentErrors = 10

# Metrics definition
[metrics]
  [metrics.datadog]
    address = "xxx.yyy.50.75:8125"
    pushInterval = "10s"

[ping]
entryPoint = "traefik"

[marathon]
endpoint = "http://mesos.our.domain:8080"
watch = true
domain = "our.domain"
exposedByDefault = false
respectReadinessChecks = true

If applicable, please paste the log output in DEBUG level (`--logLevel=DEBUG` switch)

Test scenario

one test marathon application mock-webapp with 3 marathon tasks on 3 out of 5 different mesos agents
mock-webapp has an artificial start-up delay of 15s before it opens the socket to listen to simulate java app startup
mock-webapp then answers requests with an artificial delay of 100-200ms to simulate real world service latency
AWS ALB distributing HTTP requests over 5 Traefik instances, one on each mesos agent
we are firing load at the ALB -> Traefik at 75 req/s
we enabled debug logging on one Traefik process and here is an extract excluding the successful requests (the whole thing is a few thousand lines)
traefik healthchecks are enabled every 10s

<marathon event received>
12:32:53.319  Creating server server-mock-webapp-mock-webapp-1809d998-3b4d-11e9-a7e1-2ef24d6dcd47 at http://xxx.yyy.52.15:31658 with weight 1
12:32:53.319  Creating server server-mock-webapp-mock-webapp-f726fe2d-3b54-11e9-bc7d-4a6ed5e7b0d0 at http://xxx.yyy.50.75:31518 with weight 1
12:32:53.319  Creating server server-mock-webapp-mock-webapp-65eecc6b-3b4d-11e9-a7e1-2ef24d6dcd47 at http://xxx.yyy.52.119:31928 with weight 1
12:32:53.319  Creating backend backend-mock-webapp
12:32:53.319  Setting up backend health check [Hostname:  Headers: map[] Path: /?max_latency=0.1 Port: 0 Interval: 10s]
12:32:53.319  Wiring frontend frontend-mock-webapp to entryPoint http
12:32:53.320  Creating route route-host-mock-webapp Host: mock-webapp.domain
12:32:53.326  Stopping current health check goroutines of backend: backend-mock-webapp
12:32:53.326  Initial health check for backend: "backend-mock-webapp"
12:32:53.586  '502 Bad Gateway' caused by: dial tcp xxx.yyy.50.75:31518: connect: connection refused
<there are also a few more 502 errors>

12:32:54.000	Health check failed: Remove from server list. Backend: "backend-mock-webapp" URL: "http://xxx.yyy.50.75:31518" Reason: HTTP request failed: Get http://xxx.yyy.50.75:31518/%3Fmax_latency=0.1: dial tcp xxx.yyy.50.75:31518: connect: connection refused
12:33:04.000  Refreshing health check for backend: backend-mock-webapp
12:33:04.001  Health check still failing. Backend: "backend-mock-webapp" URL: "http://xxx.yyy.50.75:31518" Reason: HTTP request failed: Get http://xxx.yyy.50.75:31518/%3Fmax_latency=0.1: dial tcp xxx.yyy.50.75:31518: connect: connection refused
12:33:09.007  Received provider event type: health_status_changed_event, event: &{health_status_changed_event 2019-02-28T12:33:09.005Z /paas/mock-webapp  2019-02-28T11:36:31.271Z %!s(bool=true)}
<traefik reconfigures itself again 2 times, no changes>

12:33:14.000  Refreshing health check for backend: backend-mock-webapp
12:33:14.571  Health check up: Returning to server list. Backend: "backend-mock-webapp" URL: "http://xxx.yyy.50.75:31518"

So you can see that for almost 700ms between 12:32:53.319 and 12:32:54.000 the backend xxx.yyy.50.75:31518 is presumed to be healthy and is being sent requests, until such time as the first healthcheck request fails, and then it is marked unhealthy.

During this period, we unavoidably get 502 errors because Traefik is sending requests to a marathon task which is not yet listening on the port.

At 12:33:09 you can see the task go healthy in marathon.
At 12:33:14 this traefik node sees the service as healthy and puts it back in rotation

Here's a log of all the 502 errors received across all 5 Traefik + mesos-agent nodes:

StartUTC      OriginStatus  Host                   Duration
12:32:53.316  502           mesos-agent-private-4  1074858
12:32:53.364  502           mesos-agent-private-2  957254
12:32:53.396  502           mesos-agent-private-3  1081129
12:32:53.421  502           mesos-agent-private-1  588835
12:32:53.476  502           mesos-agent-private-3  1041597
12:32:53.499  502           mesos-agent-private-1  753561
12:32:53.583  502           mesos-agent-private-6  2501728
12:32:53.621  502           mesos-agent-private-2  1017768
12:32:53.633  502           mesos-agent-private-4  1569354
12:32:53.653  502           mesos-agent-private-4  669198
12:32:53.729  502           mesos-agent-private-2  935705
12:32:53.754  502           mesos-agent-private-6  2740235
12:32:53.757  502           mesos-agent-private-3  1081155
12:32:53.869  502           mesos-agent-private-3  1064540
12:32:53.933  502           mesos-agent-private-6  1709245
12:32:53.968  502           mesos-agent-private-4  712834
12:32:54.032  502           mesos-agent-private-2  1064561
12:32:54.033  502           mesos-agent-private-4  772085
12:32:54.073  502           mesos-agent-private-2  984483
12:32:54.140  502           mesos-agent-private-3  1175981
12:32:54.376  502           mesos-agent-private-3  1141847
12:32:54.461  502           mesos-agent-private-3  1017805

You can see that the total period of errors is around 1145ms - presumably each individual node received the marathon event at a slightly different times and took a different amount of time to reconfigure and schedule the first healthcheck.

Before and after this period, we receive only 200 responses.

what could be done differently?

I think that the ideal behaviour here would be that the new backend that has not yet been healthchecked is marked as unhealthy until the first successful healthcheck result is processed.

would you agree?
do you think this is feasible to fix?
is there any other way that we could work around this issue?

Thanks and sorry for the very long bug report!

The text was updated successfully, but these errors were encountered:

timoreimann · 2019-03-03T00:30:23Z

Hi @andyspiers,

thanks for your bug report.

Marking Marathon tasks as unhealthy for which no health check result has been collected yet has been requested a few times in the past. The major problem is that certain scenarios ask for the inverted logic: specifically, when a new leader is elected, all health check results are reset and need to be recomputed first. If the default health state was unhealthy in such cases, all running tasks would immediately start to generate errors until health check states were recovered. In fact, that's the behavior Traefik exhibited some time ago but was changed to address the associated uptime concerns. A section in the Marathon guide also touches on the subject.

Now, I should explain that I have stopped working with Marathon/Mesos at least one year ago. I'm not sure if the situation has possibly changed on Marathon's end (for instance, whether the cluster shares state on health checks results across all masters). Also, I don't have experience with Mesos-based health checks, only Marathon-based ones which may or may not make a difference.

What I do remember is that we frequently rolling-restarted our Mesos clusters for upgrades and other reasons, and managed to maintain full service uptime (at least for those applications that implemented proper graceful termination handlers and readiness checks). I can't recall anymore on how we did it exactly, but I believe we never touched agents directly but replaced nodes as a whole (immutable infrastructure). @gottwald, could you chime in if your memory is better than mine on this one?

Either way, I'd be very open to discuss ideas you may have on how we could address your case without regressing on the scenarios described above.

Thanks!

andyspiers · 2019-03-14T13:51:31Z

Hi Timo,

Thanks for your response and apologies for the long time in replying to you.

To clarify, I'm not talking about modifying the behaviour of the Marathon provider. We have read the Traefik docs and understand that on Marathon leader election the healthcheck status becomes unknown. We did a test to confirm if this is still the case and yes, it is.

During a marathon master re-election, healthCheckResults[] went empty (unknown)

for approx 10 seconds for Marathon (HTTP) checks
for approx 1 second for Mesos (MESOS_HTTP) checks

This means they are still unusable for Traefik, as expected.

The issue/bug report is about Traefik healthchecks not Marathon healthchecks:

Is there a way that when a new backend server is discovered and populated into the Traefik config AND it has a Traefik healthcheck configured, that this service is populated but the initial state of the Traefik healthcheck is presumed unhealthy and the new server will only then receive requests after a successful Traefik healthcheck result?

Hope this clarifies the issue

Thanks,
Andy

andyspiers · 2019-03-15T16:03:05Z

.... possibly i shouldn't have mentioned marathon :) - this may be a generalised issue for Traefik healthchecks for any type of provider

Thanks,
Andy

a-nldisr · 2019-05-27T15:10:37Z

@andyspiers Have you tried setting the following label in Marathon:
traefik.backend.healthcheck.path

It could be that it does not work in current latest version, because of this report:
#3417

timoreimann · 2019-05-27T15:22:34Z

Oh darn, this one fell completely off my radar, so sorry!

@andyspiers sorry for the confusion on my end, I believe you have a good point. Happy to see if adjustments on Traefik's end are reasonable to make if the matter is still important to you.

andyspiers · 2019-06-03T11:05:19Z

@a-nldisr thanks for the input - I'll bear that in mind but that's not the issue in our particular case as a healthcheck to / is valid.

@timoreimann - thanks for getting back to me and no worries, I know how it is :)

I've also been very busy with other stuff so I haven't chased this up but it is still relevant to us, and I would still consider it a bug that a new backend (with a healthcheck defined) is populated and presumed to be healthy before the healthcheck has actually run for the first time.

If you're able to look into adjusting Traefik's behaviour in this area that would be awesome, thanks again.

Andy

timoreimann · 2019-06-13T07:29:48Z

@andyspiers based on our and additional discussions I've had with other maintainers, it seems reasonable to allow for the default health check state to be unhealthy as you requested. Popular management platforms (like Kubernetes and Docker) also default to that state.

Additionally, there doesn't seem to be any harm in changing the default behavior (as opposed to making it configurable) except for a slight increase in delay. If that's really problematic to anybody, we can contemplate adding parameters to tune health check behavior (e.g., periodicity and initial delay) where we do not have them yet. This would be beneficial beyond the issue at hand.

I can drive the necessary change myself when I have some free bandwidth. If anyone is happy to pick up the PR effort, however, I won't resist much. :)

Thanks!

andyspiers · 2019-06-13T07:34:27Z

cool - thanks so much @timoreimann for digging into this :)

tennox · 2019-12-06T01:03:07Z

Is there any way to configure this now?
A flag to set assume-healthy-at-start=false as long as you don't change default behaviour?

Or do you have hints for me to submit a PR? 😅

andyspiers · 2019-12-09T09:45:17Z

AFAIK @tennox nobody has started work on this

jasonhildebrand · 2020-02-24T16:44:09Z

I am testing a blue/green deployment scenario using the docker backend, aiming for zero dropped requests, and am hitting this same issue. The new docker container is immediately considered healthy by traefik and is sent requests before the container is fully initialized and able to process requests.

The proposed fix (allowing for the default health check state to be unhealthy) would solve this problem for me as well.

timoreimann · 2020-02-29T15:15:55Z

I won't have time to tackle this one in the foreseeable future myself -- whoever wants to drive this is more than welcome!

diegows · 2020-06-09T15:46:48Z

Hi @timoreimann!

I'd like to contribute here. I need to fix this for 1.7, because I need support for dynamodb and ecs. I might try to fix 2.2 later. Let me see if you can help to confirm some assumptions and help me with some ideas :) please.

If I understand correctly, every time there is a dynamic configuration update, all the configuration is replaced with the information from ConfigMessage. So, Traefik is not storing information from previous configuration, just parsing and replacing old config.

To implement "default false" behavior, I'm thinking on adding all the backend urls to backend.disabledURLs by default at start time. Healthcheck should move them to the Servers list. The tricky part is to store the healthy urls, so we keep them in the Servers list on config reload. I don't see what's the right place to store this if we are replacing the whole config on each config update. Do you have any hint? Not sure what's the right global place to store this info.

nokjuh · 2021-09-09T11:21:49Z

This is also a problem with Azure Service Fabric. Service Fabric publishes container endpoints immediately as the container starts, and we have not found any way to prevent it from doing this. This leads to Traefik routing traffic to the endpoints until Traefik healthcheck then runs and recognizes they are down and removes them.

It would be great if Traefik would treat the endpoints as unhealthy until a health check has passed, or would at least provide an option in health checks to choose this behavior.

bswinnerton · 2023-12-28T20:36:34Z

👋 Curious if anyone knows any workarounds to resolve brief moments of requests being routed to unhealthy servers before the health check runs?

bswinnerton · 2024-02-10T01:01:45Z

Is there appetite to change the default behavior here from up to down? Looking through the load balancer code I think it may be derived from this line:

traefik/pkg/healthcheck/healthcheck.go

Line 111 in 063f8fa

up := true

Alternatively, this could be implemented closer to the Docker provider

By default health checks are up, which is problematic for new backends that may not yet be healthy. This commit updates the health check code to default to down. Fixes traefik#4544.

xebug · 2024-03-04T11:18:53Z

any workaround ?

traefiker added the status/0-needs-triage label Feb 28, 2019

ldez added kind/enhancement a new or improved feature. area/provider/marathon priority/P3 maybe labels Mar 1, 2019

ldez assigned timoreimann Mar 1, 2019

ldez removed the status/0-needs-triage label Mar 1, 2019

dtomcej added the contributor/waiting-for-feedback label Mar 4, 2019

mmatur removed the contributor/waiting-for-feedback label Mar 15, 2019

timoreimann changed the title ~~healthcheck should not pass by default for newly discovered marathon backends~~ Traefik healthcheck should not pass by default for newly discovered backends Jun 13, 2019

timoreimann removed the area/provider/marathon label Jun 13, 2019

dduportal added the area/healthcheck label Jun 13, 2019

dduportal added this to issues in v2 via automation Dec 6, 2019

dduportal removed this from issues in v2 Dec 6, 2019

Himura2la mentioned this issue Dec 27, 2019

Sometimes service health is checked only once for docker services #6096

Closed

nokjuh mentioned this issue Mar 16, 2022

Brief window of routing traffic to container before it's ready microsoft/service-fabric-traefik#15

Open

bswinnerton linked a pull request Mar 3, 2024 that will close this issue

Default health checks to down #10494

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traefik healthcheck should not pass by default for newly discovered backends #4544

Traefik healthcheck should not pass by default for newly discovered backends #4544

andyspiers commented Feb 28, 2019

timoreimann commented Mar 3, 2019

andyspiers commented Mar 14, 2019

andyspiers commented Mar 15, 2019

a-nldisr commented May 27, 2019 •

edited

timoreimann commented May 27, 2019

andyspiers commented Jun 3, 2019

timoreimann commented Jun 13, 2019 •

edited

andyspiers commented Jun 13, 2019

tennox commented Dec 6, 2019

andyspiers commented Dec 9, 2019

jasonhildebrand commented Feb 24, 2020

timoreimann commented Feb 29, 2020

diegows commented Jun 9, 2020

nokjuh commented Sep 9, 2021

bswinnerton commented Dec 28, 2023

bswinnerton commented Feb 10, 2024

xebug commented Mar 4, 2024

Traefik healthcheck should not pass by default for newly discovered backends #4544

Traefik healthcheck should not pass by default for newly discovered backends #4544

Comments

andyspiers commented Feb 28, 2019

Do you want to request a feature or report a bug?

What did you do?

Output of traefik version: (What version of Traefik are you using?)

What is your environment & configuration (arguments, toml, provider, platform, ...)?

If applicable, please paste the log output in DEBUG level (--logLevel=DEBUG switch)

what could be done differently?

timoreimann commented Mar 3, 2019

andyspiers commented Mar 14, 2019

andyspiers commented Mar 15, 2019

a-nldisr commented May 27, 2019 • edited

timoreimann commented May 27, 2019

andyspiers commented Jun 3, 2019

timoreimann commented Jun 13, 2019 • edited

andyspiers commented Jun 13, 2019

tennox commented Dec 6, 2019

andyspiers commented Dec 9, 2019

jasonhildebrand commented Feb 24, 2020

timoreimann commented Feb 29, 2020

diegows commented Jun 9, 2020

nokjuh commented Sep 9, 2021

bswinnerton commented Dec 28, 2023

bswinnerton commented Feb 10, 2024

xebug commented Mar 4, 2024

Output of `traefik version`: (What version of Traefik are you using?)

If applicable, please paste the log output in DEBUG level (`--logLevel=DEBUG` switch)

a-nldisr commented May 27, 2019 •

edited

timoreimann commented Jun 13, 2019 •

edited