{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":763697768,"defaultBranch":"main","name":"veScale","ownerLogin":"volcengine","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2024-02-26T19:01:27.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/67365215?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1717139577.0","currentOid":""},"activityList":{"items":[{"before":"f645c306500723aa74c01ea8794c1a6135241a4b","after":null,"ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T07:12:57.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"}},{"before":"55c7f8a24206c3768009ed147e05723e2d69581f","after":"c4afc72aea866239fe688079a8607a5f95874fec","ref":"refs/heads/main","pushedAt":"2024-05-31T07:12:56.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"},"commit":{"message":"[checkpoint] feat: open source fast checkpoint system (#38)\n\n## Summary\r\n\r\nWe improved `vescale.checkpoint` with the following new features for\r\nfast checkpointing (where front three features are built-in techniques\r\nwithout necessitating manual activation):\r\n\r\n- **Saving Plan Caching**: During training, the program may save model\r\nand optimizer checkpoints every n steps. Once a saving plan is created,\r\nit remains unchanged as long as the model does. We implemented plan\r\ncaching to avoid regenerating the plan when checkpointing a model or\r\noptimizer multiple times, reducing unnecessary compute and communication\r\ncosts. As of 05/30/2024, PyTorch DCP does not support plan caching.\r\n\r\n- **Saving Plan Load-Balancing**: In data parallel training, models are\r\nreplicated across GPUs with different data parallel ranks but the same\r\npipeline and tensor parallel ranks. Existing PyTorch DCP (as of\r\n05/30/2024) deduplicates replicated tensors using a simple algorithm,\r\ncausing GPUs with data parallel rank 0 to save the entire model, leading\r\nto load imbalance. We implemented a load-balancing algorithm to address\r\nthis issue when deduplicating model tensors.\r\n\r\n- **D2H Tensor Copying via Pinned Memory**: When copying tensors from\r\nGPU to host memory, `vescale.checkpoint` uses pinned host memory,\r\nreducing memory allocation costs each time a checkpoint is saved. As of\r\n05/30/2024, PyTorch DCP does not support pinned memory.\r\n\r\n- **Checkpoint Broadcasting**: In data parallel training, models are\r\nreplicated across GPUs with different data parallel ranks but the same\r\npipeline and tensor parallel ranks. If `broadcast_checkpoint` is\r\nenabled, `vescale.checkpoint.load` lets GPUs with data parallel rank 0\r\nto load the model and broadcast it to other GPUs with higher data\r\nparallel ranks. If GPUs are connected with NCCL and I/O bandwidth is\r\nfully utilized, broadcasting model tensors speeds up checkpoint loading\r\ncompared to all GPUs loading models from persistent storage. E.g.:\r\n\r\n ```python\r\n # prepare checkpoint state for the model and optimizer\r\ncheckpoint_state = { \"model\": distributed_model, \"optimizer\":\r\ndistributed_optimizer }\r\n # load the checkpoint\r\nvescale.checkpoint.load(\"/user/vescale/gpt/\", checkpoint_state,\r\nbroadcast_checkpoint=True)\r\n ```\r\n\r\n- **Asynchronous Checkpointing**: When `vescale.checkpoint.save` is\r\ncalled, it first generates a saving plan and then synchronously copies\r\ntensors from GPU to host memory. If `async_checkpoint` is enabled, the\r\ntraining program can continue after the D2H copying, while\r\n`vescale.checkpoint.save` continues to serialize tensors and dump the\r\ncheckpoint to persistent storage asynchronously without blocking\r\ntraining. As of 05/30/2024, PyTorch DCP does not support asynchronous\r\ncheckpointing. E.g.:\r\n\r\n ```python\r\n # prepare checkpoint state for the model and optimizer\r\ncheckpoint_state = { \"model\": distributed_model, \"optimizer\":\r\ndistributed_optimizer }\r\n # save the checkpoint asynchronuously\r\nvescale.checkpoint.save(\"/user/vescale/gpt/\", checkpoint_state,\r\nasync_checkpoint=True)\r\n ```\r\n\r\n## Acknowledgement\r\n\r\nWe sincerely appreciate all contributors including but not limited to\r\n@shanesyy-1992 @raywan-110 @lazychao @AHEADer @MingjiHan99","shortMessageHtmlLink":"[checkpoint] feat: open source fast checkpoint system (#38)"}},{"before":null,"after":"f645c306500723aa74c01ea8794c1a6135241a4b","ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T05:00:52.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"},"commit":{"message":"open source fast checkpoint system","shortMessageHtmlLink":"open source fast checkpoint system"}},{"before":"28ddefa528cc9e45cf051adfb5b7e48598b45609","after":null,"ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T04:58:03.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"}},{"before":null,"after":"28ddefa528cc9e45cf051adfb5b7e48598b45609","ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T04:47:52.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"},"commit":{"message":"open source fast checkpoint","shortMessageHtmlLink":"open source fast checkpoint"}},{"before":"f044f7214934c3df91193f7663ac60cdf1b017b5","after":null,"ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T04:45:59.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"}},{"before":null,"after":"f044f7214934c3df91193f7663ac60cdf1b017b5","ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T04:34:10.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"},"commit":{"message":"open source fast checkpoint","shortMessageHtmlLink":"open source fast checkpoint"}},{"before":"a32b3347076eebc6263a2c27497a55bd064f090d","after":null,"ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T03:48:44.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"}},{"before":null,"after":"a32b3347076eebc6263a2c27497a55bd064f090d","ref":"refs/heads/opensource_053024","pushedAt":"2024-05-31T02:46:35.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"},"commit":{"message":"update readme","shortMessageHtmlLink":"update readme"}},{"before":"2a072bfe2a4697f934325c0ad415a420691146f6","after":"55c7f8a24206c3768009ed147e05723e2d69581f","ref":"refs/heads/main","pushedAt":"2024-05-24T03:43:38.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"leonardo0lyj","name":"Youjie Li","path":"/leonardo0lyj","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16678974?s=80&v=4"},"commit":{"message":"[Example] Add comments to example codes (#36)\n\nIn this PR, we add comments explaining VeScale APIs in the nanoGPT\r\nexample.","shortMessageHtmlLink":"[Example] Add comments to example codes (#36)"}},{"before":"9047a730c08c2e467d2bfd624891b19b4fc28513","after":"2a072bfe2a4697f934325c0ad415a420691146f6","ref":"refs/heads/main","pushedAt":"2024-05-21T17:51:42.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"leonardo0lyj","name":"Youjie Li","path":"/leonardo0lyj","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16678974?s=80&v=4"},"commit":{"message":"[DTensor&DModule&DDP&Examples] feature updates and new examples (#35)\n\nIn this PR, we add two examples and update some features in DTensor,\r\nDModule, and DDP.\r\n\r\n## Examples\r\n\r\n1. 4D finetuning the llama2_3b model.\r\n2. 4D pretraining a mixtral MOE-based model\r\n\r\n## DTensor\r\n\r\n1. Update op strategies on `Partial`ed and `InterleavedShard`ed\r\ndtensors.\r\n2. Add all-to-all communications.\r\n\r\n## DModule\r\n\r\n1. Support factory methods for nested submodules\r\n\r\n## DDP\r\n\r\n1. Unblock gradient allreduce for sparse modules in DDP","shortMessageHtmlLink":"[DTensor&DModule&DDP&Examples] feature updates and new examples (#35)"}},{"before":"aa390759e132cdaebfbc23585c53d4e21cd579cd","after":null,"ref":"refs/heads/youjie/readme","pushedAt":"2024-05-12T01:31:59.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"pengyanghua","name":"yhpeng","path":"/pengyanghua","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/13825158?s=80&v=4"}},{"before":"dd44ba5effb3c3ce3ee26b9515eea982567e351e","after":"9047a730c08c2e467d2bfd624891b19b4fc28513","ref":"refs/heads/main","pushedAt":"2024-05-12T01:31:59.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"pengyanghua","name":"yhpeng","path":"/pengyanghua","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/13825158?s=80&v=4"},"commit":{"message":"[docs] fix: update README (#34)","shortMessageHtmlLink":"[docs] fix: update README (#34)"}},{"before":null,"after":"aa390759e132cdaebfbc23585c53d4e21cd579cd","ref":"refs/heads/youjie/readme","pushedAt":"2024-05-11T23:27:02.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"leonardo0lyj","name":"Youjie Li","path":"/leonardo0lyj","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16678974?s=80&v=4"},"commit":{"message":"update","shortMessageHtmlLink":"update"}},{"before":"be51034ed4f4744493fe480a6472882ee0eca6d6","after":null,"ref":"refs/heads/open_source_042524","pushedAt":"2024-04-26T17:46:26.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"}},{"before":"97735b16f989d5bc1c85d1a0a6f85288a396c33f","after":"dd44ba5effb3c3ce3ee26b9515eea982567e351e","ref":"refs/heads/main","pushedAt":"2024-04-26T17:46:25.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"DeviceMesh initialization fix and VeDeviceMesh 2.0 (#33)","shortMessageHtmlLink":"DeviceMesh initialization fix and VeDeviceMesh 2.0 (#33)"}},{"before":null,"after":"be51034ed4f4744493fe480a6472882ee0eca6d6","ref":"refs/heads/open_source_042524","pushedAt":"2024-04-26T07:15:36.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"committed","shortMessageHtmlLink":"committed"}},{"before":"a1e7c98cb60fbcfc59aadf56eed5c80660d37970","after":null,"ref":"refs/heads/open_source_041824","pushedAt":"2024-04-23T06:45:50.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"}},{"before":"937ba9105083a2ae4f7990dc37942cd5a5897359","after":null,"ref":"refs/heads/open_source_041824_before_randomness","pushedAt":"2024-04-23T06:44:37.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"}},{"before":"2f2daaa3688f2f32f5524c66374d019e9e75c17c","after":"97735b16f989d5bc1c85d1a0a6f85288a396c33f","ref":"refs/heads/main","pushedAt":"2024-04-23T06:44:36.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"added veDeviceMesh (#32)\n\nThis PR introduces veDeviceMesh, the device mesh API that integrates\r\nhandling of submeshes and process groups in performing training with\r\nDDP, TP/SP, distributed optimizer and checkpointing. It also updates\r\nfixes and patches related to veDeviceMesh API to the repository since\r\nlast PR.","shortMessageHtmlLink":"added veDeviceMesh (#32)"}},{"before":"fc2555f895efe474d88cc60393a3864d71ec5c15","after":"937ba9105083a2ae4f7990dc37942cd5a5897359","ref":"refs/heads/open_source_041824_before_randomness","pushedAt":"2024-04-23T05:39:15.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"added flash attention","shortMessageHtmlLink":"added flash attention"}},{"before":"beae0a218e61b86d56b834c20971db6025a88a54","after":"fc2555f895efe474d88cc60393a3864d71ec5c15","ref":"refs/heads/open_source_041824_before_randomness","pushedAt":"2024-04-23T04:31:38.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"removed files","shortMessageHtmlLink":"removed files"}},{"before":"d2d5b405bbbdd64ce516cf33c1c243d2cafd0178","after":"beae0a218e61b86d56b834c20971db6025a88a54","ref":"refs/heads/open_source_041824_before_randomness","pushedAt":"2024-04-23T00:51:06.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"committed veDeviceMesh and restructured files","shortMessageHtmlLink":"committed veDeviceMesh and restructured files"}},{"before":"e638ba53b5f7431a380056bf53082fb01ed9d052","after":"d2d5b405bbbdd64ce516cf33c1c243d2cafd0178","ref":"refs/heads/open_source_041824_before_randomness","pushedAt":"2024-04-22T19:04:27.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"test case","shortMessageHtmlLink":"test case"}},{"before":"b050069afb27b1b61595bad3c27201d827deb0ea","after":"e638ba53b5f7431a380056bf53082fb01ed9d052","ref":"refs/heads/open_source_041824_before_randomness","pushedAt":"2024-04-22T18:00:34.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"updated patches","shortMessageHtmlLink":"updated patches"}},{"before":null,"after":"b050069afb27b1b61595bad3c27201d827deb0ea","ref":"refs/heads/open_source_041824_before_randomness","pushedAt":"2024-04-22T07:26:45.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"},"commit":{"message":"added veDeviceMesh","shortMessageHtmlLink":"added veDeviceMesh"}},{"before":"addf61c9523800ec8608555412661d3fd54bd927","after":"a1e7c98cb60fbcfc59aadf56eed5c80660d37970","ref":"refs/heads/open_source_041824","pushedAt":"2024-04-20T07:07:33.000Z","pushType":"push","commitsCount":0,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"}},{"before":null,"after":"addf61c9523800ec8608555412661d3fd54bd927","ref":"refs/heads/open_source_041824","pushedAt":"2024-04-19T23:51:07.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"MackZackA","name":null,"path":"/MackZackA","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14083881?s=80&v=4"}},{"before":"bbf2860c34ca557cad3c59503666b70b86b405fc","after":"2f2daaa3688f2f32f5524c66374d019e9e75c17c","ref":"refs/heads/main","pushedAt":"2024-04-17T22:37:38.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"liwenchangbdbz","name":"Li-Wen Chang","path":"/liwenchangbdbz","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120213201?s=80&v=4"},"commit":{"message":"[PyTorch] Add patches for distributed randomness (#29)\n\nIn this PR, we update the pytorch patch for including the DTensor\r\nsharding info in Cuda RNG states.","shortMessageHtmlLink":"[PyTorch] Add patches for distributed randomness (#29)"}},{"before":"455920611f98d359033943ff2bb58d3c23ce9808","after":null,"ref":"refs/heads/checkpoint_open_source_2","pushedAt":"2024-04-10T19:38:41.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"MingjiHan99","name":"MingjiHan","path":"/MingjiHan99","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9149201?s=80&v=4"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEWLx6pgA","startCursor":null,"endCursor":null}},"title":"Activity ยท volcengine/veScale"}