Precision Problem between nemo model and hugging face model #9137

ChencongZJU · 2024-05-08T11:29:38Z

Describe the bug

We are using nemo to training our large vision language model. When converting models from nemo format to hugging face format, we found that given the same inputs and weights, we get different outputs.

We test that after layer normalization, even using the same weights and input, outputs are different.
Nemo use transformers engine and below code to calculate:

And hugging face using pytorch

And I also found there also exists little precision gap in rotational positional embedding、attention and FFN.

Expected behavior

Is the precision gap caussed by different calculation operator? How can I fix that?

Thank you!

yaoyu-33 · 2024-05-08T15:41:54Z

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

ChencongZJU · 2024-05-16T06:24:30Z

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

Thanks for your patient reply. We also test that the precission gab doesn't affect perfoemance.

ChencongZJU added the bug Something isn't working label May 8, 2024

ChencongZJU closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precision Problem between nemo model and hugging face model #9137

Precision Problem between nemo model and hugging face model #9137

ChencongZJU commented May 8, 2024

yaoyu-33 commented May 8, 2024

ChencongZJU commented May 16, 2024

Precision Problem between nemo model and hugging face model #9137

Precision Problem between nemo model and hugging face model #9137

Comments

ChencongZJU commented May 8, 2024

yaoyu-33 commented May 8, 2024

ChencongZJU commented May 16, 2024