Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SGMV for prefill BGMV for decode #464

Merged
merged 33 commits into from May 14, 2024
Merged

Use SGMV for prefill BGMV for decode #464

merged 33 commits into from May 14, 2024

Conversation

tgaddair
Copy link
Contributor

@tgaddair tgaddair commented May 9, 2024

Closes #333.

There were broadly two main issues affecting LoRAX throughput for single-adapter performance vs vLLM:

  1. Use of CUDA graph compilation by default
  2. Use of BGMV kernel during decode

In this PR, we change BGMV to be the default during decode, and apply it to CUDA graph mode. Additionally, we retain SGMV for prefill (non-CUDA graph) and have a just-in-time tracing of specific LoRA layers to avoid having to replay computation for unused LoRA layers. All in all, we are now a good bit ahead of vLLM performance on single LoRA inference, and continue to even better at multi-LoRA scale. We further find that using Medusa gives an additional boost to performance that makes LoRA inference faster than base model performance (no adapter).

vllm + compile (baseline): 61 tokens/s
lorax (baseline, sgmv only): 52 tokens/s
lorax + bgmv: 59 tokens/s
lorax + bgmv + compile: 65 tokens/s
lorax + bgmv + compile + medusa: 73 tokens/s

@tgaddair tgaddair marked this pull request as ready for review May 13, 2024 22:18
@tgaddair tgaddair merged commit 7306d49 into main May 14, 2024
1 check passed
@tgaddair tgaddair deleted the fix-graph-lora branch May 14, 2024 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance drop when using a LoRA Adapter
1 participant