[FEATURE] Use of Nvidia Transformer Engine #1858

yazdanimehdi · 2023-06-21T02:33:51Z

Using https://github.com/NVIDIA/TransformerEngine to speed up transformer based models on new Nvidia Hopper GPUs and float8 training.
 Ideally it will detect that you are using Ada based GPUs and adapt the transformer engine 

rwightman · 2023-08-28T21:52:57Z

@yazdanimehdi finally got around to picking up a 4090. It's nice and decent boost when using torchcompile.

I tried fiddling with Transformer Engine and FP8 autocast and it wasn't very helpful. I feel it needs re-writing the models to use fused layers and fully integrate the attention. Just doing the 'easy' bits such as converting nn.Linear and nn.LayerNorm and using te.autocast is slower than using torch native AMP w/ F.sdpa + bfloat16. The te the attention won't be using a fast kernel, and some of the matmuls won't be cast to lower precision, cannot combine torch autocast w/ te autocast it seems.

So, until torch decides to include some ada/hopper compatible FP8 support & casting + optimized kernels for e.g F.sdpa, I don't think there is much point, I am not going to maintain multiple versions of various blocks / models, etc with TE vs not.

yazdanimehdi added the enhancement New feature or request label Jun 21, 2023

rwightman closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Use of Nvidia Transformer Engine #1858

[FEATURE] Use of Nvidia Transformer Engine #1858

yazdanimehdi commented Jun 21, 2023

rwightman commented Aug 28, 2023

[FEATURE] Use of Nvidia Transformer Engine #1858

[FEATURE] Use of Nvidia Transformer Engine #1858

Comments

yazdanimehdi commented Jun 21, 2023

rwightman commented Aug 28, 2023