Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 3 GPUs is not as good as expectation compare with 2 GPUs; NV vs AMD performace; flash attention not support for AMD GPUs #5503

Open
0781532 opened this issue May 6, 2024 · 0 comments
Labels
bug Something isn't working training

Comments

@0781532
Copy link

0781532 commented May 6, 2024

I have encountered some challenges when using Deepspeed that we hope to address with your expertise.

  1. During fine-tuning LLM LLama-7b-chat-hf and LLama-13b-chat-hf with multiple GPUs, I observed the following token-per-second speeds: 1 GPU (60 tokens/s), 2 GPUs (178 tokens/s), 3 GPUs (230 tokens/s), and 4 GPUs (300 tokens/s). Surprisingly, the efficiency did not exhibit a proportional increase with the addition of GPUs beyond two. 3 GPUs is not as good as expectation compare with 2 GPUs. Are there any possible technical explanation for this issue?

  2. Under identical conditions using the TRX50 motherboard, we compared the performance of two configurations:
    Case 1: NV RTX 4090 x 2 cards
    Case 2: AMD Radeon Pro W7900 x 2 cards
    Initially, two months ago, the AMD Radeon Pro W7900 outperformed the NV RTX 4090 in terms of speed (tokens/s) for LLama-7b-chat-hf and LLama-13b-chat-hf models. However, in my recent tests, the NV RTX 4090 surpassed the AMD Radeon Pro W7900, both with and without the flash-attn feature enabled.

I seek your insights on these issues. Are there any explanations for these fluctuations in performance? Are certain versions of Deepspeed optimized for specific GPU types, such as the NV RTX 4090 or the AMD W7900?

  1. I also want to ask why flash-attn cannot support for AMD GPUs (Radeon pro W7800, W7900)?

Thank you!
Le

@0781532 0781532 added bug Something isn't working training labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant