Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fargate] [request]: Service Connect Health Checks #2334

Open
muzfuz opened this issue Apr 22, 2024 · 3 comments
Open

[Fargate] [request]: Service Connect Health Checks #2334

muzfuz opened this issue Apr 22, 2024 · 3 comments
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue

Comments

@muzfuz
Copy link

muzfuz commented Apr 22, 2024

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Service Connect does not support application health checks. This means it attempts to route traffic to containers before they're ready.

We would like Service Connect to have configurable health checks similar to ALBs, or to respect the Docker healthchecks which are configured in the task definition.

Which service(s) is this request for?
Fargate - specifically Service Connect options.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We run several "big" services which have a long startup time (10 to 60 seconds). These services communicate privately using Service Connect.

We noticed that we were getting served 503s during deploys or container restarts.

After some back and forth with AWS Support we were able to establish the following sequence of events:

  1. Deployment starts
  2. All containers in the task start booting
  3. Service Connect sidecar is marked "HEALTHY" in the task
  4. Clients start receiving 503 responses
  5. The main app container finishes booting and is also marked "HEALTHY".
  6. 503's stop

I received the following guidance on this from AWS Support:

Service Connect registers the task to CloudMap during the ACTIVATING stage of the task lifecycle [1], and from my testing traffic is sent to the new task as soon as the task enters RUNNING status. ... Based on my testing, it appears that Service Connect does not wait for the container to enter into "HEALTHY" status before sending traffic.

From our POV we would like one of two things to be true here.

  1. Service Connect waits for the app container to be marked as "HEALTHY" by the task before routing traffic to it.
    OR
  2. Service Connect provides a way of configuring a health check endpoint.

The fact that it is currently simply routing traffic to a task as soon as the Envoy sidecar becomes healthy means we need to do some pretty aggressive retries in the client applications, which works to paper over the cracks but can still lead to failure.

Are you currently working around this issue?
Yes. A combination of aggressive retries and long Docker health checks has proven effective.

We received the following guidance from AWS Support:

Configure a container health check in the ECS task definition with startPeriod 60. Due to the startPeriod setting, although the new task will start in UNHEALTHY, ECS does not replace the task for 60 seconds. At the same time, the old task is kept alive. Service Connect has both tasks registered in CloudMap and will send traffic to both using round-robin.

This solution "works" but is merely a sticking plaster - it can still lead to failed requests and needlessly extends deploy / restart times.

@muzfuz muzfuz added the Proposed Community submitted issue label Apr 22, 2024
@herrhound herrhound added the ECS Amazon Elastic Container Service label Apr 29, 2024
@kshivaz
Copy link

kshivaz commented Apr 30, 2024

Thanks for this request. We鈥檇 like to check into the behavior you saw more thoroughly. Could you share your support case ID so we can look at your specific setup?

@muzfuz
Copy link
Author

muzfuz commented May 10, 2024

@kshivaz thank you for looking at this. The case ID is 171276418200173.

@kshivaz
Copy link

kshivaz commented May 10, 2024

Thanks @muzfuz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants