-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] GPU Feature discovery label formatter #3493
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @asaiacai! It looks very reasonable to me. @romilbhardwaj for another look to make sure it does not break our other formatters : )
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @asaiacai!
return cls.LABEL_KEY | ||
|
||
@classmethod | ||
def get_accelerator_from_label_value(cls, value: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to implement get_label_value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to. It's also tricky to create since the set of accelerator types can have one -> many mappings to GFD label values. One that I've run into is PCIE vs SXM variants of the A100 and H100 gpus. This is the main reason I changed the logic to check by going from (acc_type --> label value) to (label value --> acc_type). Outside of there, get_label_value
is only used for autoscaling which GFDLabeler wouldn't be used.
Resolves #2460
This allows k8s to consume the node label
nvidia.com/gpu.product
created by GPU feature discovery which is commonly deployed through the NVIDIA GPU operatorTested (run the relevant ones):
bash format.sh
eks_test_cluster.yaml
k3s
withgpu-operator
usingdeploy_k3s.sh
modified to exclude the skypilot k8s labeler, ensure the following can run