Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the hardware specs for fiddler #7

Open
fangyu29 opened this issue May 9, 2024 · 1 comment
Open

Questions about the hardware specs for fiddler #7

fangyu29 opened this issue May 9, 2024 · 1 comment

Comments

@fangyu29
Copy link

fangyu29 commented May 9, 2024

Copying 300MB weights parameters (one expert of mixtral-8x7b) from cpu to gpu requiring 50ms indicates that the PCIe bandwidth is only 0.3GB/50ms = 6GB/s, which is much slower than the reported L4 gpu's PCIe bandwidth (PCIe Gen4 x16 64GB/s) in https://www.nvidia.com/en-us/data-center/l4/ , is there any explanation about it? Thanks.

@fangyu29
Copy link
Author

fangyu29 commented May 9, 2024

I doubt the reason is the load_state_dict function in

expert_placeholder.load_state_dict(
is slow, and I have implemented another version of cpu-to-gpu weights copy using torch.copy_ and pin_memory in https://github.com/dingfangyu/fiddler/blob/a6a09cca2c0e95dbcdd39a6e9296e890fd56d4cd/benchmarks/microbench.py#L39

on my 4090 gpu, the profiling results are as below:

1) Weight copy, CPU -> GPU
mean: 13.30 ms, std: 0.01 ms

5) Execution, GPU batch=1
mean: 0.59 ms, std: 0.14 ms

6) Execution, CPU batch=1
mean: 11.41 ms, std: 0.57 ms

the cpu-to-gpu bandwidth is 0.328 / 0.0133 = 24.66 GB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant