Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] GPU Feature discovery label formatter #3493

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

asaiacai
Copy link
Contributor

@asaiacai asaiacai commented Apr 27, 2024

Resolves #2460

This allows k8s to consume the node label nvidia.com/gpu.product created by GPU feature discovery which is commonly deployed through the NVIDIA GPU operator

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual test: test against GKE labels (tested against T4)
  • Manual test: test against skypilot labeler script labels on EKS deployed via eks_test_cluster.yaml
  • Manual tests: deploy k3s with gpu-operator using deploy_k3s.sh modified to exclude the skypilot k8s labeler, ensure the following can run
# check nvidia-smi and nvidia.com/gpu.product info
nvidia-smi --query-gpu=name --format=csv,noheader,nounits
kubectl describe node | grep nvidia.com/gpu.product
# test skypilot against gpu type
sky show-gpus --cloud kubernetes
sky launch --cloud kubernetes --gpus <GPU_TYPE>
  • A100-80GB
  • A100
  • H100
  • T4
  • V100
  • A10G
  • P100
  • P4
  • L4

@asaiacai asaiacai marked this pull request as ready for review May 7, 2024 00:24
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @asaiacai! It looks very reasonable to me. @romilbhardwaj for another look to make sure it does not break our other formatters : )

sky/provision/kubernetes/utils.py Show resolved Hide resolved
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai!

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
return cls.LABEL_KEY

@classmethod
def get_accelerator_from_label_value(cls, value: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to implement get_label_value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to. It's also tricky to create since the set of accelerator types can have one -> many mappings to GFD label values. One that I've run into is PCIE vs SXM variants of the A100 and H100 gpus. This is the main reason I changed the logic to check by going from (acc_type --> label value) to (label value --> acc_type). Outside of there, get_label_value is only used for autoscaling which GFDLabeler wouldn't be used.

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
tests/kubernetes/scripts/deploy_k3s.sh Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Support Nvidia GFD Labels for GPU type detection
3 participants