[k8s] GPU Feature discovery label formatter #3493

asaiacai · 2024-04-27T02:25:27Z

Resolves #2460

This allows k8s to consume the node label nvidia.com/gpu.product created by GPU feature discovery which is commonly deployed through the NVIDIA GPU operator

Tested (run the relevant ones):

Code formatting: bash format.sh
Manual test: test against GKE labels (tested against T4)
Manual test: test against skypilot labeler script labels on EKS deployed via eks_test_cluster.yaml
Manual tests: deploy k3s with gpu-operator using deploy_k3s.sh modified to exclude the skypilot k8s labeler, ensure the following can run

# check nvidia-smi and nvidia.com/gpu.product info
nvidia-smi --query-gpu=name --format=csv,noheader,nounits
kubectl describe node | grep nvidia.com/gpu.product
# test skypilot against gpu type
sky show-gpus --cloud kubernetes
sky launch --cloud kubernetes --gpus <GPU_TYPE>

Michaelvll

This is awesome @asaiacai! It looks very reasonable to me. @romilbhardwaj for another look to make sure it does not break our other formatters : )

sky/provision/kubernetes/utils.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

romilbhardwaj

Thanks @asaiacai!

sky/provision/kubernetes/utils.py

romilbhardwaj · 2024-05-15T09:20:00Z

sky/provision/kubernetes/utils.py

+        return cls.LABEL_KEY
+
+    @classmethod
+    def get_accelerator_from_label_value(cls, value: str) -> str:


Do we also need to implement get_label_value?

I don't think we need to. It's also tricky to create since the set of accelerator types can have one -> many mappings to GFD label values. One that I've run into is PCIE vs SXM variants of the A100 and H100 gpus. This is the main reason I changed the logic to check by going from (acc_type --> label value) to (label value --> acc_type). Outside of there, get_label_value is only used for autoscaling which GFDLabeler wouldn't be used.

sky/provision/kubernetes/utils.py

tests/kubernetes/scripts/deploy_k3s.sh

asaiacai added 17 commits April 27, 2024 02:09

GFDLabel formatter for k8s

4e71e6f

update comment

b91a576

format

95302b4

substring match against k8s labels instead of strict matching

5d3b360

cleanup

acf9968

use k8s label

e3bbde9

map k8s label value to accelerator instead of accelerator to label value

c87a401

remove unused get_gke_accelerator_name

d1b7b4c

remove get acc from value func

db9a091

pattern match against A100'

3da382c

pattern match against A100'

4be3589

format

57e0f14

fix typo

2c06136

format

4103455

re.search

ac6a51e

compare strings

d703095

add P4000

f82c2a1

asaiacai marked this pull request as ready for review May 7, 2024 00:24

asaiacai added 2 commits May 10, 2024 01:32

merge

54035d0

format

bc8fbd7

Michaelvll reviewed May 10, 2024

View reviewed changes

sky/provision/kubernetes/utils.py Show resolved Hide resolved

Michaelvll requested a review from romilbhardwaj May 10, 2024 18:10

lower case for check

0319215

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

romilbhardwaj reviewed May 15, 2024

View reviewed changes

asaiacai added 3 commits May 16, 2024 14:03

force upper case

75e4396

match skypilot labeler logic

7d4d6bb

format.sh

a85014a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] GPU Feature discovery label formatter #3493

[k8s] GPU Feature discovery label formatter #3493

asaiacai commented Apr 27, 2024 •

edited

Michaelvll left a comment

romilbhardwaj left a comment

romilbhardwaj May 15, 2024

asaiacai May 16, 2024

[k8s] GPU Feature discovery label formatter #3493

Are you sure you want to change the base?

[k8s] GPU Feature discovery label formatter #3493

Conversation

asaiacai commented Apr 27, 2024 • edited

Michaelvll left a comment

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj May 15, 2024

Choose a reason for hiding this comment

asaiacai May 16, 2024

Choose a reason for hiding this comment

asaiacai commented Apr 27, 2024 •

edited