You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.
Here is the events log of the worker pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m35s default-scheduler Successfully assigned sampler/random-exp-jw6qxmrm-worker-0 to acorvin-hpo-poc-jfrlm-worker-0-twvtz
Normal AddedInterface 9m33s multus Add eth0 [10.131.5.61/23] from openshift-sdn
Normal Pulling 9m33s kubelet Pulling image "quay.io/bharathappali/alpine:3.10"
Normal Pulled 9m32s kubelet Successfully pulled image "quay.io/bharathappali/alpine:3.10" in 1.065165424s (1.065174057s including waiting)
Warning BackOff 2m49s kubelet Back-off restarting failed container init-pytorch in pod random-exp-jw6qxmrm-worker-0_sampler(8d6860a7-204d-45c8-bb57-8d84a6cf8e66)
Normal Created 2m34s (x3 over 9m31s) kubelet Created container init-pytorch
Normal Started 2m34s (x3 over 9m31s) kubelet Started container init-pytorch
Normal Pulled 2m34s (x2 over 6m11s) kubelet Container image "quay.io/bharathappali/alpine:3.10" already present on machine
I have changed the init container image due to docker pull limits issue
Here is the pod log:
nslookup: can't resolve 'random-exp-jw6qxmrm-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve
I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.
Here is the events log of the worker pod:
I have changed the init container image due to docker pull limits issue
Here is the pod log:
Here is the pytorch experiment I'm deploying
The text was updated successfully, but these errors were encountered: