{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":506953078,"defaultBranch":"master","name":"dlrover","ownerLogin":"intelligent-machine-learning","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2022-06-24T09:31:07.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/107632618?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1715579791.0","currentOid":""},"activityList":{"items":[{"before":"9104f8061a64da53cd082be82725ae50e59bac3a","after":"86dcb5dbaf5717d57f5d73439d1ac2d955e8d374","ref":"refs/heads/master","pushedAt":"2024-06-03T17:47:28.000Z","pushType":"pr_merge","commitsCount":13,"pusher":{"login":"samplise","name":"Bo Sang","path":"/samplise","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31138775?s=80&v=4"},"commit":{"message":"Merge pull request #1153 from workingloong/backup-ckpt\n\nRestore checkpoint replica from the shared memory of other nodes.","shortMessageHtmlLink":"Merge pull request #1153 from workingloong/backup-ckpt"}},{"before":"b74c2a424814a87f02cb8db9314b9ff433b605f8","after":"9104f8061a64da53cd082be82725ae50e59bac3a","ref":"refs/heads/master","pushedAt":"2024-06-03T03:22:12.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Remove codes to get master pod. (#1154)\n\n* Remove codes to get master pod.\r\n\r\n* Format codes.\r\n\r\n* Set the dlrover version.","shortMessageHtmlLink":"Remove codes to get master pod. (#1154)"}},{"before":"c7013fa3ab383ec7229b0157ba1d8d9fa117c01f","after":"b74c2a424814a87f02cb8db9314b9ff433b605f8","ref":"refs/heads/master","pushedAt":"2024-06-03T03:16:48.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Update async-checkpoint.md (#1155)\n\nfix the mistake of \"manager\"","shortMessageHtmlLink":"Update async-checkpoint.md (#1155)"}},{"before":"f02a6b82090406e1a9065b51a76b36ea666d65d6","after":"c7013fa3ab383ec7229b0157ba1d8d9fa117c01f","ref":"refs/heads/master","pushedAt":"2024-05-31T06:35:27.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Polish the log of timeout to wait nodes. (#1152)","shortMessageHtmlLink":"Polish the log of timeout to wait nodes. (#1152)"}},{"before":"27c965f0e7b636feb821c4891ffad031409aa777","after":"f02a6b82090406e1a9065b51a76b36ea666d65d6","ref":"refs/heads/master","pushedAt":"2024-05-31T06:19:08.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"fix the rdzv name to check node. (#1151)\n\n* fix the rdzv name to check node.\r\n\r\n* Skip setting env if nccl configuration is empty.","shortMessageHtmlLink":"fix the rdzv name to check node. (#1151)"}},{"before":"0dfeadcadc0b10260e99cc97a05e3c617e48c15e","after":"27c965f0e7b636feb821c4891ffad031409aa777","ref":"refs/heads/master","pushedAt":"2024-05-30T11:58:53.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Set the nccl env to execute gpu test task. (#1148)\n\n* Set the nccl env to execute gpu test task.\r\n\r\n* Fix test cases.\r\n\r\n* Fix by comments.","shortMessageHtmlLink":"Set the nccl env to execute gpu test task. (#1148)"}},{"before":"aa5e6b6ffdb2cb6e3238ce912d8c9a0538b87ab7","after":"0dfeadcadc0b10260e99cc97a05e3c617e48c15e","ref":"refs/heads/master","pushedAt":"2024-05-30T11:18:37.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Remove the codes to remove the checkpoint directory. (#1149)","shortMessageHtmlLink":"Remove the codes to remove the checkpoint directory. (#1149)"}},{"before":"93df4afca02974ee25b4d4b29195647415857716","after":"aa5e6b6ffdb2cb6e3238ce912d8c9a0538b87ab7","ref":"refs/heads/master","pushedAt":"2024-05-30T08:00:10.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"The restarted node can acquire the full checkpoint in the peer node. (#1145)\n\n* The restarted node can acquire the checkpoint in the peer node.\r\n\r\n* Add test cases.\r\n\r\n* Fix by comments.\r\n\r\n* Fix to call super init.","shortMessageHtmlLink":"The restarted node can acquire the full checkpoint in the peer node. (#…"}},{"before":"aff7c112b7667fa62c9344b87ac550ac7905cec3","after":"93df4afca02974ee25b4d4b29195647415857716","ref":"refs/heads/master","pushedAt":"2024-05-28T11:48:28.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"more retrying (#1143)","shortMessageHtmlLink":"more retrying (#1143)"}},{"before":"2ca451d9638efbe307cf1d99180134ab3c945ea8","after":"aff7c112b7667fa62c9344b87ac550ac7905cec3","ref":"refs/heads/master","pushedAt":"2024-05-28T03:07:37.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Fix the tutorial to run the example of auto scaling the TF job. (#1142)","shortMessageHtmlLink":"Fix the tutorial to run the example of auto scaling the TF job. (#1142)"}},{"before":"fa73b88d1382f530511c050a12908bd310195150","after":"2ca451d9638efbe307cf1d99180134ab3c945ea8","ref":"refs/heads/master","pushedAt":"2024-05-27T10:37:38.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"hxdtest","name":null,"path":"/hxdtest","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/107838921?s=80&v=4"},"commit":{"message":"Fix the example of deeprec. (#1141)","shortMessageHtmlLink":"Fix the example of deeprec. (#1141)"}},{"before":"b9c14c5fc8878ebc0801e04a926b2ac422834e25","after":"fa73b88d1382f530511c050a12908bd310195150","ref":"refs/heads/master","pushedAt":"2024-05-27T04:38:09.000Z","pushType":"pr_merge","commitsCount":4,"pusher":{"login":"samplise","name":"Bo Sang","path":"/samplise","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31138775?s=80&v=4"},"commit":{"message":"Merge pull request #1140 from workingloong/backup-ckpt\n\nThe nodes backup the checkpoint data in the shared meory.","shortMessageHtmlLink":"Merge pull request #1140 from workingloong/backup-ckpt"}},{"before":"a9359aa30b0d6f0554f2f809e2cae3e54e1ad342","after":"b9c14c5fc8878ebc0801e04a926b2ac422834e25","ref":"refs/heads/master","pushedAt":"2024-05-24T02:21:02.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"collect error content (#1139)\n\n* collect error content\r\n\r\n* lint","shortMessageHtmlLink":"collect error content (#1139)"}},{"before":"77977a794f999f7d9deb169d1734eed11b2635c1","after":"a9359aa30b0d6f0554f2f809e2cae3e54e1ad342","ref":"refs/heads/master","pushedAt":"2024-05-23T02:42:47.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Design to backup checkpoint shards between nodes. (#1137)\n\n* Desigo to backup checkpoint shards between nodes.\r\n\r\n* Format markdown.\r\n\r\n* Polish the design to checkpoint in memory.\r\n\r\n* Format markdown files.\r\n\r\n* Fix by comments.","shortMessageHtmlLink":"Design to backup checkpoint shards between nodes. (#1137)"}},{"before":"bdd87369a292e73c1efac62bae0ed1dddd4a112a","after":"77977a794f999f7d9deb169d1734eed11b2635c1","ref":"refs/heads/master","pushedAt":"2024-05-21T01:49:56.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"List Go version 1.18 in the docs. (#1134)","shortMessageHtmlLink":"List Go version 1.18 in the docs. (#1134)"}},{"before":"1c19f42ed780a7d1d2f4ed77ef1fc8b0395e5e1e","after":"bdd87369a292e73c1efac62bae0ed1dddd4a112a","ref":"refs/heads/master","pushedAt":"2024-05-20T07:15:02.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Refactor the test cases of SimpleStrategyGenerator (#1133)","shortMessageHtmlLink":"Refactor the test cases of SimpleStrategyGenerator (#1133)"}},{"before":"b15f5997aa028bc5e09e797a9b08c9f902401ab6","after":"1c19f42ed780a7d1d2f4ed77ef1fc8b0395e5e1e","ref":"refs/heads/master","pushedAt":"2024-05-20T03:25:00.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"yzlnew","name":"黄石","path":"/yzlnew","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4904877?s=80&v=4"},"commit":{"message":"Add Bayesian Optimization for DLRover Brain (#1119)","shortMessageHtmlLink":"Add Bayesian Optimization for DLRover Brain (#1119)"}},{"before":"2892c05f23078b3db0e513d75bc4173bb97a530a","after":"b15f5997aa028bc5e09e797a9b08c9f902401ab6","ref":"refs/heads/master","pushedAt":"2024-05-17T06:06:51.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"optimize master retrieving (#1131)\n\n* optimize master retrieving\r\n\r\n* revert\r\n\r\n* lint","shortMessageHtmlLink":"optimize master retrieving (#1131)"}},{"before":"3a28daf6a4885073f95549851d506e091548edbf","after":"2892c05f23078b3db0e513d75bc4173bb97a530a","ref":"refs/heads/master","pushedAt":"2024-05-17T03:47:11.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Support torch>=2.3.0 (#1130)\n\n* Support torch>=2.3.0\r\n\r\n* Remove unused imports\r\n\r\n* Fix test case.\r\n\r\n* Refactor to add test cases.","shortMessageHtmlLink":"Support torch>=2.3.0 (#1130)"}},{"before":"39a0a472fe6a5fe42f3063c1d1c1586b0f7c1f88","after":"3a28daf6a4885073f95549851d506e091548edbf","ref":"refs/heads/master","pushedAt":"2024-05-17T02:14:16.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"megatron-lm new api (#1128)\n\n* megatron-lm new api\r\n\r\n* keep backward compatibility","shortMessageHtmlLink":"megatron-lm new api (#1128)"}},{"before":"573af1dd59e4be225537f0294a709b4a9dbb1e89","after":"39a0a472fe6a5fe42f3063c1d1c1586b0f7c1f88","ref":"refs/heads/master","pushedAt":"2024-05-16T06:30:59.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Tune up the sleep interval to query the new rdzv. (#1127)\n\n* Tune up the sleep interval to query the new rdzv.\r\n\r\n* Simplify logs.\r\n\r\n* Add test cases.\r\n\r\n* Format codes.","shortMessageHtmlLink":"Tune up the sleep interval to query the new rdzv. (#1127)"}},{"before":"c1c76d99c03fc8ae58d22301e7b4405a1edf0355","after":"573af1dd59e4be225537f0294a709b4a9dbb1e89","ref":"refs/heads/master","pushedAt":"2024-05-16T02:17:03.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"optimize MODIFIED event (#1126)","shortMessageHtmlLink":"optimize MODIFIED event (#1126)"}},{"before":"557a56ae931e71865f33ad86e0671107fc1ce99b","after":"c1c76d99c03fc8ae58d22301e7b4405a1edf0355","ref":"refs/heads/master","pushedAt":"2024-05-15T11:38:54.000Z","pushType":"pr_merge","commitsCount":2,"pusher":{"login":"BalaBalaYi","name":"Tianyi Chen","path":"/BalaBalaYi","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/23303886?s=80&v=4"},"commit":{"message":"Merge pull request #1125 from BalaBalaYi/fix_pod_event_convert\n\nfix pod event converting invocation","shortMessageHtmlLink":"Merge pull request #1125 from BalaBalaYi/fix_pod_event_convert"}},{"before":"27cb42af1900ee3c6e40aa58045b1c6d94c0f857","after":"557a56ae931e71865f33ad86e0671107fc1ce99b","ref":"refs/heads/master","pushedAt":"2024-05-15T07:27:17.000Z","pushType":"pr_merge","commitsCount":5,"pusher":{"login":"BalaBalaYi","name":"Tianyi Chen","path":"/BalaBalaYi","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/23303886?s=80&v=4"},"commit":{"message":"Merge pull request #1124 from BalaBalaYi/add_log\n\nadd log for pod event watcher","shortMessageHtmlLink":"Merge pull request #1124 from BalaBalaYi/add_log"}},{"before":"8c058a8611ff9fb673508d081d6c9f421360d575","after":"27cb42af1900ee3c6e40aa58045b1c6d94c0f857","ref":"refs/heads/master","pushedAt":"2024-05-14T06:05:53.000Z","pushType":"pr_merge","commitsCount":7,"pusher":{"login":"BalaBalaYi","name":"Tianyi Chen","path":"/BalaBalaYi","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/23303886?s=80&v=4"},"commit":{"message":"Merge pull request #1122 from BalaBalaYi/ignore_pod_failover_by_tkp\n\nIgnore pod deleted event if pod failover is already done by k8s","shortMessageHtmlLink":"Merge pull request #1122 from BalaBalaYi/ignore_pod_failover_by_tkp"}},{"before":"807d3650840e63f5835167c392f903ab802b0431","after":"8c058a8611ff9fb673508d081d6c9f421360d575","ref":"refs/heads/master","pushedAt":"2024-05-14T05:59:16.000Z","pushType":"pr_merge","commitsCount":2,"pusher":{"login":"samplise","name":"Bo Sang","path":"/samplise","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31138775?s=80&v=4"},"commit":{"message":"Merge pull request #1120 from workingloong/simplify-logs\n\nRemove unused logs of checking gpu.","shortMessageHtmlLink":"Merge pull request #1120 from workingloong/simplify-logs"}},{"before":"0041da243d08d2bbb9f95985f2e013ab707bde8f","after":"807d3650840e63f5835167c392f903ab802b0431","ref":"refs/heads/master","pushedAt":"2024-05-13T23:31:03.000Z","pushType":"pr_merge","commitsCount":5,"pusher":{"login":"samplise","name":"Bo Sang","path":"/samplise","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/31138775?s=80&v=4"},"commit":{"message":"Merge pull request #1118 from workingloong/test-allreduce-perf\n\nTest allreduce performance after checking node health.","shortMessageHtmlLink":"Merge pull request #1118 from workingloong/test-allreduce-perf"}},{"before":"87e587290f3f920c37fd24a1ee59c968a39c3116","after":"0041da243d08d2bbb9f95985f2e013ab707bde8f","ref":"refs/heads/master","pushedAt":"2024-05-13T05:52:20.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Polish the log of executing matmul ops. (#1117)","shortMessageHtmlLink":"Polish the log of executing matmul ops. (#1117)"}},{"before":"83047b2dbae40c356b157dea02ca88db359d768b","after":"87e587290f3f920c37fd24a1ee59c968a39c3116","ref":"refs/heads/master","pushedAt":"2024-05-13T05:20:25.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Remove empty.proto in the elastic_training.proto (#1116)\n\n* Remove empty.proto in the elastic_training.proto\r\n\r\n* Set the rdzv backend and timeout.","shortMessageHtmlLink":"Remove empty.proto in the elastic_training.proto (#1116)"}},{"before":"f6e7059a870b6beee6fa2dd628ebbf181551586f","after":"83047b2dbae40c356b157dea02ca88db359d768b","ref":"refs/heads/master","pushedAt":"2024-05-13T02:55:39.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"workingloong","name":"Qinlong Wang","path":"/workingloong","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/18071380?s=80&v=4"},"commit":{"message":"Refactor /examples/pytorch/NanoGPT/*_train.py (#1111)\n\n* Refactored examples/pytorch/NanoGPT/\r\n\r\n* Refactored examples/pytorch/NanoGPT/\r\n\r\n* Refactored examples/pytorch/NanoGPT/\r\n\r\n* Refactored examples/pytorch/NanoGPT/ds_train.py\r\n\r\n* Refactored examples/pytorch/NanoGPT/fsdp_train.py\r\n\r\n* Refactor /examples/pytorch/NanoGPT/*_train.py\r\n\r\n* Commit pre-commit changes.\r\n\r\n* Fix a little bug in examples/pytorch/nanogpt/fsdp_train.py\r\n\r\n* Change the \"use_native\" case in train.py, when loading data.","shortMessageHtmlLink":"Refactor /examples/pytorch/NanoGPT/*_train.py (#1111)"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEWznILQA","startCursor":null,"endCursor":null}},"title":"Activity · intelligent-machine-learning/dlrover"}