RuntimeError: ('Put data input KVStore server failed.', URLError(error(104, 'Connection reset by peer'),)) #3388
Unanswered
xiaogengyaokeyan
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi guys, recently I'm using horovod.run.runner.run() python entry for one machine 8gpus trainning,
and I run with this RuntimeError:
which seems like an http error, however, I run the train 10 times about 4 times it ran into this error, the other times it just ran fine without any error...
btw, when I use shell command:
horovodrun -np 8 python distributed_train.py
I have never ran into this error, if this is caused by the http... how could I avoid this error using python entry
Beta Was this translation helpful? Give feedback.
All reactions