-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] [GKE] Fail to request T4 instance #3506
Comments
Thanks for the report @asaiacai - I'm unable to reproduce this on d27e0ff. Can you share a reproduction script, a bit more about how you created the cluster and the full output of Here's how I created my cluster:
|
I have an existing GKE cluster $ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=cluster-1
$ gcloud beta container node-pools create t4-nodepool --cluster=${CLUSTER_NAME} --zone=us-central1-c --node-locations=us-central1-c --num-nodes=1 --total-min-nodes=1 --total-max-nodes=1 --reservation-affinity=none --no-enable-autorepair --location-policy=ANY --machine-type=n1-standard-2 --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Note: Starting in GKE 1.30, if you don't specify a driver version, GKE installs the default GPU driver for your node's GKE version.
Creating node pool t4-nodepool...done.
Created [https://container.googleapis.com/v1beta1/projects/trainy-test/zones/us-central1-c/clusters/cluster-1/nodePools/t4-nodepool].
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
t4-nodepool n1-standard-2 100 1.28.7-gke.1026000
$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
cloud.google.com/gke-accelerator=nvidia-tesla-t4
cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...
$ sky show-gpus --cloud kubernetes
COMMON_GPU AVAILABLE_QUANTITIES
T4 1
Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
$ sky launch --cloud kubernetes --gpus T4
I 05-02 18:08:39 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
resources: Kubernetes({'T4': 1}).
To fix: relax or change the resource requirements.
$ sky -c
skypilot, commit d27e0ff83c56983920a655fbeaddc96b2758752e |
Ah looks like your instance does not have enough memory to satisfy the default resource request of 2 CPUs and 8GB memory. Note that some CPU millicores and memory goes to k8s components, so n1-standard-2 with 2 CPUs and 7.5GB memory would not be able to fit the default resources requested by SkyPilot. This is surfaced in debug logs (
Explicitly specifying a lower CPU/mem request (e.g., TODO for us is to make the log messages better - perhaps |
Bumping the priority for this - another user ran into issues with SkyPilot unable to use resources on k8s and had to use SKYPILOT_DEBUG=1 to surface the error. This should be logged to info. |
I'm running a single T4 node on GKE. Nodes are properly labeled as shown below and
sky show-gpus --cloud kubernetes
is also correct but fails to launch.Version & Commit info:
sky -v
:skypilot, version 1.0.0-dev0
sky -c
:skypilot, commit 889adce65602b76e31f60534ce25c264bad7cb83
The text was updated successfully, but these errors were encountered: