Llama training with FP8 #331

pbelevich · 2024-05-15T02:53:55Z

No description provided.

KeitaW

Overall, this looks a great feature addition to 10.FSDP. Do you think TE support can be added in the test case instead of creating new one?

KeitaW · 2024-05-15T04:07:28Z

3.test_cases/XX.transformer-engine/.gitignore

@@ -0,0 +1,2 @@
+checkpoints
+slurm-*.out


slurm-*.out should already be excluded by https://github.com/aws-samples/awsome-distributed-training/blob/main/.gitignore

KeitaW · 2024-05-15T04:16:16Z

3.test_cases/XX.transformer-engine/te_llama.py

@@ -0,0 +1,183 @@
+# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.


You may want to refer original code https://github.com/NVIDIA/TransformerEngine/blob/16a469df6bbc77e1c32e48e8e5fd3082dbc2d18e/docs/examples/te_llama/te_llama.py

pbelevich · 2024-05-15T16:07:48Z

@KeitaW thanks for the review! I was thinking about adding FP8 support to FSDP example, but there are two aspects why I decided to create a separate example for this:

Transformer Engine requires Nvidia's container to run(or as alternative relatively complicated process of building from source with CUDA headers, CUDNN etc). And I don't want to complicate FSDP example with it.
This example is bound to Llama model(taken from TE examples), but FSDP example supports multiple models that I don't want to rewrite with FP8.

So, in terms of importance this example is about LLama with FP8. FSDP training here is just kind of scaffolding.

sbhavani · 2024-05-30T18:23:12Z

@pbelevich FYI the AWS DLC for PyTorch also includes TE

Llama FSDP training with FP8

663b344

pbelevich force-pushed the transformer-engine branch from 87c6612 to cb07958 Compare May 15, 2024 03:11

pbelevich requested a review from KeitaW May 15, 2024 03:12

KeitaW reviewed May 15, 2024

View reviewed changes

pbelevich changed the title ~~Llama FSDP training with FP8~~ Llama training with FP8 May 15, 2024

KeitaW force-pushed the transformer-engine branch from cb07958 to 82b3e89 Compare June 3, 2024 22:53

KeitaW force-pushed the main branch from 8dc7dc0 to 44e448e Compare June 3, 2024 22:53

KeitaW force-pushed the transformer-engine branch from 82b3e89 to 663b344 Compare June 4, 2024 02:26

KeitaW force-pushed the main branch 3 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama training with FP8 #331

Llama training with FP8 #331

pbelevich commented May 15, 2024

KeitaW left a comment •

edited

KeitaW May 15, 2024

KeitaW May 15, 2024

pbelevich commented May 15, 2024

sbhavani commented May 30, 2024

		@@ -0,0 +1,2 @@
		checkpoints
		slurm-*.out

Llama training with FP8 #331

Are you sure you want to change the base?

Llama training with FP8 #331

Conversation

pbelevich commented May 15, 2024

KeitaW left a comment • edited

Choose a reason for hiding this comment

KeitaW May 15, 2024

Choose a reason for hiding this comment

KeitaW May 15, 2024

Choose a reason for hiding this comment

pbelevich commented May 15, 2024

sbhavani commented May 30, 2024

KeitaW left a comment •

edited