New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BenchMark 复现遇到了问题 #128
Comments
你好,感谢你对我们项目的关注。 Hello, thank you for your interest in our project. |
感谢回复~
|
你的第二点猜测是对的。目前的限制来源于我们的混合推理实现,在权重可以完全offload到GPU的情况下,依然存在大量不必要的CPU-GPU同步点。我们会在很近期解决这个问题,实现纯GPU推理。请关注我们的进展💪 Your second guess is correct. The current limitation stems from our hybrid inference implementation, where there are still numerous unnecessary CPU-GPU synchronization points, even when weights can be fully offloaded to the GPU. We are planning to resolve this issue in the near future to achieve pure GPU inference. Please stay tuned for our updates 💪 |
作者你们好, 感谢你们的耐心回复, 我有几个问题想进一步探讨下:
|
完全offload GPU推理主要需要修改的部分是FFN的计算图,我们需要提供一个快速路径,在其中去除和CPU-GPU混合计算相关的部分,比如GPU index和GPU bucket此时是不需要的。这部分的代码在 Attention层的稀疏性视不同的模型差异较大,在目前我们支持的模型并没有显著体现。因此在此开源代码中我们没有计划支持。可以参考 #111 中的讨论。 The primary modifications needed for complete GPU offload are in the FFN's computation graph. We need to provide a fast path that removes elements related to CPU-GPU hybrid computation, such as the GPU index and GPU buckets, which are unnecessary in this context. This part of the code is in the The sparsity of the Attention layer varies significantly across different models and has not shown to be significant in the models we currently support. Therefore, we do not plan to support it in this open-source code. You can refer to the discussion in issue #111. |
你好,请问一下预计多久会完成全gpu推理的开发 @hodlen |
我们目前刚刚着手开发,预计需要一至两周时间来完成开发和性能测试。
|
@hodlen 辛苦问下全gpu推理开发的如何了?有预期的发布时间吗? |
Hello 作者你好, 我正在复现 PowerInfer, 对比 llama.cpp 和 PowerInfer 的性能。在基准测试阶段我遇到了一些意料之外的结果。
环境
编译
PowerInfer:
cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release
llama.cpp
cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release
模型
PowerInfer 使用的模型: https://huggingface.co/PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
llama.cpp 使用的模型: https://huggingface.co/SparseLLM/ReluLLaMA-7B
convert 转换命令: python3 convert.py ./ReluLLaMA-7B --outtype f16, 生成 ggml-model-f16.gguf
运行&结果
./build/bin/main -m ../models/PowerInfer_ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 1000
./build/bin/main -m ../models/ggml-model-f16.gguf -n 128 -t 8 -p "Once upon a time" -ngl 100
疑问
在 eval time, PowerInfer 的速度是 17.15 tokens per second, llama.cpp 的速度是 63.86 tokens per second, 是不是我 PowerInfer 的配置不对导致的, 辛苦帮忙提些建议哈~
The text was updated successfully, but these errors were encountered: