Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precision Problem between nemo model and hugging face model #9137

Closed
ChencongZJU opened this issue May 8, 2024 · 2 comments
Closed

Precision Problem between nemo model and hugging face model #9137

ChencongZJU opened this issue May 8, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@ChencongZJU
Copy link

Describe the bug

We are using nemo to training our large vision language model. When converting models from nemo format to hugging face format, we found that given the same inputs and weights, we get different outputs.

We test that after layer normalization, even using the same weights and input, outputs are different.
Nemo use transformers engine and below code to calculate:

飞书20240508-190258

And hugging face using pytorch
image

And I also found there also exists little precision gap in rotational positional embedding、attention and FFN.

Expected behavior

Is the precision gap caussed by different calculation operator? How can I fix that?

Thank you!

@ChencongZJU ChencongZJU added the bug Something isn't working label May 8, 2024
@yaoyu-33
Copy link
Collaborator

yaoyu-33 commented May 8, 2024

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

@ChencongZJU
Copy link
Author

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

Thanks for your patient reply. We also test that the precission gab doesn't affect perfoemance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants