Replies: 2 comments 1 reply
-
just an update, even if I called
and resets state |
Beta Was this translation helpful? Give feedback.
0 replies
-
How did you use TestDiscovery class?
…On Thu, Jan 27, 2022 at 1:38 AM Mohamed Yousef ***@***.***> wrote:
just an update, even if I called hvd.shutdown() before exiting it still
gives
ERROR:horovod.ray.elastic:10.128.0.85[1]:The actor died unexpectedly
before finishing this task.
and resets state
—
Reply to this email directly, view it on GitHub
<#3386 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFISPOSJRV5GVYEDZ7ZZGTUYEHBNANCNFSM5M5BV4KQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I understand elastic Horovod support two modes of resumming operation upon a node failure: graceful (HostsUpdatedInterrupt) and non-graceful (HorovodInternalError)
My question is, on a public cloud (e.g. GCP), using Horovod on Ray, if I can catch locally (on the node) the notification of a node eviction/preemption, how can I raise this to the rest of the world before a failure happens ? tried running
ray stop
on that node and many other things, but always a failure happens first, then a non-graceful rollback+ host blacklistThe only way I could test the graceful behavior was using the
TestDiscovery
class provided in the examplesThanks!
Beta Was this translation helpful? Give feedback.
All reactions