Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] [GKE] Fail to request T4 instance #3506

Open
asaiacai opened this issue May 2, 2024 · 4 comments · May be fixed by #3590
Open

[k8s] [GKE] Fail to request T4 instance #3506

asaiacai opened this issue May 2, 2024 · 4 comments · May be fixed by #3590
Labels
k8s Kubernetes related items

Comments

@asaiacai
Copy link
Contributor

asaiacai commented May 2, 2024

I'm running a single T4 node on GKE. Nodes are properly labeled as shown below and sky show-gpus --cloud kubernetes is also correct but fails to launch.

(sky) gcpuser@gfd-ebd1-head-evggxnxq-compute:~/skypilot$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4
                      cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...
(sky) gcpuser@gfd-ebd1-head-evggxnxq-compute:~/skypilot$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
(sky) gcpuser@gfd-ebd1-head-evggxnxq-compute:~/skypilot$ sky launch --cloud kubernetes --gpus T4
I 05-02 08:48:17 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'T4': 1}).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Version & Commit info:

  • sky -v: skypilot, version 1.0.0-dev0
  • sky -c: skypilot, commit 889adce65602b76e31f60534ce25c264bad7cb83
@romilbhardwaj
Copy link
Collaborator

Thanks for the report @asaiacai - I'm unable to reproduce this on d27e0ff. Can you share a reproduction script, a bit more about how you created the cluster and the full output of sky launch --cloud kubernetes --gpus T4?

Here's how I created my cluster:

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=gkeusc4

$ gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.27.12-gke.1115000" --release-channel "regular" --machine-type "n1-standard-16" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES
T4          1

$ sky launch --cloud kubernetes --gpus T4
# My cluster ran as expected.

@asaiacai
Copy link
Contributor Author

asaiacai commented May 2, 2024

I have an existing GKE cluster cluster-1 that I created a new nodepool adding one T4 instance. I shouldn't need to purge ~/.sky right?

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=cluster-1

$ gcloud beta container node-pools create t4-nodepool  --cluster=${CLUSTER_NAME}  --zone=us-central1-c  --node-locations=us-central1-c     --num-nodes=1     --total-min-nodes=1     --total-max-nodes=1     --reservation-affinity=none     --no-enable-autorepair     --location-policy=ANY   --machine-type=n1-standard-2     --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Note: Starting in GKE 1.30, if you don't specify a driver version, GKE installs the default GPU driver for your node's GKE version.
Creating node pool t4-nodepool...done.                                                                                             
Created [https://container.googleapis.com/v1beta1/projects/trainy-test/zones/us-central1-c/clusters/cluster-1/nodePools/t4-nodepool].
NAME         MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
t4-nodepool  n1-standard-2  100           1.28.7-gke.1026000

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4
                      cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
$ sky launch --cloud kubernetes --gpus T4
I 05-02 18:08:39 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'T4': 1}).

To fix: relax or change the resource requirements.

$ sky -c
skypilot, commit d27e0ff83c56983920a655fbeaddc96b2758752e

@romilbhardwaj
Copy link
Collaborator

Ah looks like your instance does not have enough memory to satisfy the default resource request of 2 CPUs and 8GB memory. Note that some CPU millicores and memory goes to k8s components, so n1-standard-2 with 2 CPUs and 7.5GB memory would not be able to fit the default resources requested by SkyPilot.

This is surfaced in debug logs (export SKYPILOT_DEBUG=1):

D 05-02 13:40:51 kubernetes.py:344] Instance type 2CPU--8GB--1T4 does not fit in the Kubernetes cluster. Reason: GPU nodes with T4 do not have enough CPU and/or memory. Maximum resources found on a single node: 2.0 CPUs, 7.3G Memory

Explicitly specifying a lower CPU/mem request (e.g., sky launch --cloud kubernetes --gpus T4 --cpus 1 --memory 2) should work.

TODO for us is to make the log messages better - perhaps resources: Kubernetes({'T4': 1}) should have shown the CPUs and memory requested. Leaving the issue open for us to fix logging. Thanks for the report!

@romilbhardwaj romilbhardwaj added the k8s Kubernetes related items label May 2, 2024
@romilbhardwaj
Copy link
Collaborator

Bumping the priority for this - another user ran into issues with SkyPilot unable to use resources on k8s and had to use SKYPILOT_DEBUG=1 to surface the error. This should be logged to info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes related items
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants