Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BenchMark 复现遇到了问题 #128

Open
wahaha22 opened this issue Jan 23, 2024 · 8 comments
Open

BenchMark 复现遇到了问题 #128

wahaha22 opened this issue Jan 23, 2024 · 8 comments
Labels
question Further information is requested

Comments

@wahaha22
Copy link

Hello 作者你好, 我正在复现 PowerInfer, 对比 llama.cpp 和 PowerInfer 的性能。在基准测试阶段我遇到了一些意料之外的结果。

环境

  • 代码
  • 操作系统
    • CentOS Linux 7
    • uname -m -r: 4.19.95-35 x86_64
  • 硬件环境
    • nvidia-smi: NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4, A100*2, 单块显存 80G.
  • 软件环境(from cmake)
    • GNU: 8.3.0
    • CUDAToolkit: 11.2.152

编译

  • PowerInfer:
    cmake -S . -B build -DLLAMA_CUBLAS=ON
    cmake --build build --config Release

  • llama.cpp
    cmake -S . -B build -DLLAMA_CUBLAS=ON
    cmake --build build --config Release

模型

PowerInfer 使用的模型: https://huggingface.co/PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
llama.cpp 使用的模型: https://huggingface.co/SparseLLM/ReluLLaMA-7B
convert 转换命令: python3 convert.py ./ReluLLaMA-7B --outtype f16, 生成 ggml-model-f16.gguf

运行&结果

  • PowerInfer:
    ./build/bin/main -m ../models/PowerInfer_ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 1000
llama_print_timings:        load time =    1314.45 ms
llama_print_timings:      sample time =      16.28 ms /    89 runs   (    0.18 ms per token,  5465.49 tokens per second)
llama_print_timings: prompt eval time =     153.64 ms /     5 tokens (   30.73 ms per token,    32.54 tokens per second)
llama_print_timings:        eval time =    5131.53 ms /    88 runs   (   58.31 ms per token,    17.15 tokens per second)
llama_print_timings:       total time =    5331.24 ms
Log end
  • llama.cpp:
    ./build/bin/main -m ../models/ggml-model-f16.gguf -n 128 -t 8 -p "Once upon a time" -ngl 100
llama_print_timings:        load time =    1900.48 ms
llama_print_timings:      sample time =      20.78 ms /   128 runs   (    0.16 ms per token,  6160.66 tokens per second)
llama_print_timings: prompt eval time =      38.83 ms /     5 tokens (    7.77 ms per token,   128.77 tokens per second)
llama_print_timings:        eval time =    1988.74 ms /   127 runs   (   15.66 ms per token,    63.86 tokens per second)
llama_print_timings:       total time =    2084.53 ms
Log end

疑问

在 eval time, PowerInfer 的速度是 17.15 tokens per second, llama.cpp 的速度是 63.86 tokens per second, 是不是我 PowerInfer 的配置不对导致的, 辛苦帮忙提些建议哈~

@wahaha22 wahaha22 added the question Further information is requested label Jan 23, 2024
@YixinSong-e
Copy link
Collaborator

你好,感谢你对我们项目的关注。
目前PowerInfer开源的代码针对的场景是 模型超出显存容量的场景,而对于模型完全能放进显存里的场景,此时计算应该都在GPU上,理论上PowerInfer框架也能够提供1.5倍-2倍的加速比(这个结果应该与DejaVu的论文对齐),但目前我们没有针对完全offload到GPU的场景进行计算图的设计以及算子的实现。因此,我的建议是你可以尝试跑一下Falcon-40B模型或者llama-70B模型的不量化版本。
未来,我们计划支持,即使模型能够完全放进GPU的情况下的加速。

Hello, thank you for your interest in our project.
Currently, the open-source code of PowerInfer is designed for scenarios where the model exceeds the capacity of the GPU memory. For scenarios where the model can entirely fit within the GPU memory, calculations would take place on the GPU. In theory, the PowerInfer framework should also provide a 1.5x to 2x acceleration, which should align with the results in the DejaVu paper. However, at present, we have not designed the computational graph or implemented operators for scenarios where the computation is completely offloaded to the GPU. Therefore, my suggestion is for you to try running the non-quantized versions of the Falcon-40B model or the llama-70B model.
In the future, we plan to introduce acceleration support even when the model can entirely fit within the GPU.

@wahaha22
Copy link
Author

wahaha22 commented Jan 23, 2024

感谢回复~
"理论上PowerInfer框架也能够提供1.5倍-2倍的加速比(这个结果应该与DejaVu的论文对齐),但目前我们没有针对完全offload到GPU的场景进行计算图的设计以及算子的实现"
我理解的理论上的1.5-2倍加速比是应该是 predictor 带来的, 但是在我的实验过程中没有出现, 我想知道是否是如下的原因:

  1. 我的运行参数不对, 导致 predictor 失效? 或者没有尽可能使用 GPU 计算?
  2. PowerInfer 不支持完全 offload 到 GPU 进行计算 (而 llama.cpp 是支持的, -ngl 参数)?

@hodlen
Copy link
Collaborator

hodlen commented Jan 24, 2024

你的第二点猜测是对的。目前的限制来源于我们的混合推理实现,在权重可以完全offload到GPU的情况下,依然存在大量不必要的CPU-GPU同步点。我们会在很近期解决这个问题,实现纯GPU推理。请关注我们的进展💪


Your second guess is correct. The current limitation stems from our hybrid inference implementation, where there are still numerous unnecessary CPU-GPU synchronization points, even when weights can be fully offloaded to the GPU. We are planning to resolve this issue in the near future to achieve pure GPU inference. Please stay tuned for our updates 💪

@wahaha22
Copy link
Author

作者你们好, 感谢你们的耐心回复, 我有几个问题想进一步探讨下:

  1. 想做到完全 offload GPU 推理, 目前需要改哪些代码? 如果您有相关的修改指南或清单(如特定函数等),能否分享给我以供参考?
  2. 目前我在代码中仅看到 MLP 的 predictor, 但似乎没有找到 Attention 部分的 predictor, 有别于 Dejavu 的两种 predictor (MLP predictor+Attention head predictor), 我想知道您是否有计划在未来支持Attention层的predictor?

@hodlen
Copy link
Collaborator

hodlen commented Jan 27, 2024

完全offload GPU推理主要需要修改的部分是FFN的计算图,我们需要提供一个快速路径,在其中去除和CPU-GPU混合计算相关的部分,比如GPU index和GPU bucket此时是不需要的。这部分的代码在llama.cpp/llm_build_ffn_sparse函数。我们会很快着手进行这项工作。此外,当我们不需要考虑CPU-GPU混合运算时,在底层的GPU算子上,也可以类似地提供一个快速路径,代码在ggml-cuda.cu中。

Attention层的稀疏性视不同的模型差异较大,在目前我们支持的模型并没有显著体现。因此在此开源代码中我们没有计划支持。可以参考 #111 中的讨论。


The primary modifications needed for complete GPU offload are in the FFN's computation graph. We need to provide a fast path that removes elements related to CPU-GPU hybrid computation, such as the GPU index and GPU buckets, which are unnecessary in this context. This part of the code is in the llama.cpp/llm_build_ffn_sparse function. We will soon start working on this. Additionally, when CPU-GPU hybrid computation is not a consideration, we can similarly provide a fast path at the lower GPU operator level, which is located in ggml-cuda.cu.

The sparsity of the Attention layer varies significantly across different models and has not shown to be significant in the models we currently support. Therefore, we do not plan to support it in this open-source code. You can refer to the discussion in issue #111.

@sleepcoo
Copy link

sleepcoo commented Feb 6, 2024

完全offload GPU推理主要需要修改的部分是FFN的计算图,我们需要提供一个快速路径,在其中去除和CPU-GPU混合计算相关的部分,比如GPU index和GPU bucket此时是不需要的。这部分的代码在llama.cpp/llm_build_ffn_sparse函数。我们会很快着手进行这项工作。此外,当我们不需要考虑CPU-GPU混合运算时,在底层的GPU算子上,也可以类似地提供一个快速路径,代码在ggml-cuda.cu中。

Attention层的稀疏性视不同的模型差异较大,在目前我们支持的模型并没有显著体现。因此在此开源代码中我们没有计划支持。可以参考 #111 中的讨论。

The primary modifications needed for complete GPU offload are in the FFN's computation graph. We need to provide a fast path that removes elements related to CPU-GPU hybrid computation, such as the GPU index and GPU buckets, which are unnecessary in this context. This part of the code is in the llama.cpp/llm_build_ffn_sparse function. We will soon start working on this. Additionally, when CPU-GPU hybrid computation is not a consideration, we can similarly provide a fast path at the lower GPU operator level, which is located in ggml-cuda.cu.

The sparsity of the Attention layer varies significantly across different models and has not shown to be significant in the models we currently support. Therefore, we do not plan to support it in this open-source code. You can refer to the discussion in issue #111.

你好,请问一下预计多久会完成全gpu推理的开发 @hodlen

@hodlen
Copy link
Collaborator

hodlen commented Feb 6, 2024

我们目前刚刚着手开发,预计需要一至两周时间来完成开发和性能测试。

你好,请问一下预计多久会完成全gpu推理的开发 @hodlen

@pkumc
Copy link

pkumc commented Mar 4, 2024

@hodlen 辛苦问下全gpu推理开发的如何了?有预期的发布时间吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants