Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too-Small Clusters Cause Jobs to Fail Without Logs #3541

Open
time-less-ness opened this issue May 10, 2024 · 0 comments
Open

Too-Small Clusters Cause Jobs to Fail Without Logs #3541

time-less-ness opened this issue May 10, 2024 · 0 comments

Comments

@time-less-ness
Copy link

To reproduce, create a cluster of t2.nano (or otherwise very small) instances, and try to exec any simple Skypilot job that just does echo "thing". All such jobs will fail. Log files will be empty.

Uncertain how to proceed. Best guess is: when running sky launch check RAM size of suggested nodes, and if it's less than 2GB (or so? I'm not sure the actual cutoff, but 1GB is definitely too small) then output a very loud WARNING that cluster jobs may fail without output because nodes are too RAM-constrained for underlying cluster orchestration to run properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant