Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS][GPU Failure Handling]: Nvidia EC2 Instance Failure Handling in Managed Node Group #2324

Open
yimipeng opened this issue Apr 6, 2024 · 0 comments
Labels
EKS Managed Nodes EKS Managed Nodes EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@yimipeng
Copy link
Member

yimipeng commented Apr 6, 2024

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
What do you want us to build?

(Out-of-Box) NativeAWS Experience for Nvidia EC2 Instance Failure Handling on EKS include:

  • NIX etc. failure metrics monitoring (currently latest AWS CloudWatch Observability Add-on does not have that)
  • Unhealth instance detection, termination and replacement managed by EKS ManagedNodeGroup

Which service(s) is this request for?
This could be Fargate, ECS, EKS, ECR
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

Are you currently working around this issue?
How are you currently solving this problem?

Additional context
Anything else we should know?

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

@yimipeng yimipeng added the Proposed Community submitted issue label Apr 6, 2024
@mikestef9 mikestef9 added EKS Amazon Elastic Kubernetes Service EKS Managed Nodes EKS Managed Nodes labels Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Managed Nodes EKS Managed Nodes EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

2 participants