You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now, the features of DLRover to recover the training from failure are:
Automatically diagnose the status of GPU/ Ascend-NPU machine. Blog
Flash Checkpoint to asynchronously persist the checkpoint from the shared memory to the storage. Blog
Welcome to comment if you want to use DLRover on your chips.
The text was updated successfully, but these errors were encountered:
workingloong
changed the title
How to use DLRover on AI chips
[RFC] How to use DLRover on nodes with AI chips.
Mar 6, 2024
workingloong
changed the title
[RFC] How to use DLRover on nodes with AI chips.
[RFC] Welcome to give requirements to use DLRover on nodes with AI chips.
Mar 6, 2024
Now, the features of DLRover to recover the training from failure are:
Welcome to comment if you want to use DLRover on your chips.
The text was updated successfully, but these errors were encountered: