Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11

Open
LeMoussel opened this issue Jan 2, 2024 · 1 comment
Open

Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11

LeMoussel opened this issue Jan 2, 2024 · 1 comment

Comments

@LeMoussel
Copy link

I'm a bit lost with the different quantization approaches such as GGUF, ExLlamaV2 & this project?
Is it the same thing? Is one approach faster?

GGUF: TheBloke/Mixtral-8x7B-v0.1-GGUF
ExLlamaV2: turboderp/Mixtral-8x7B-instruct-exl2

@lavawolfiee
Copy link
Collaborator

lavawolfiee commented Jan 2, 2024

No, it's not the same thing.

Regarding ExLlamaV2 and llama.cpp (GGUF), I think it depends on your setup. As far as I know, ExLlamaV2 is faster on GPU but doesn't support CPU inference. llama.cpp on the other hand can split layers between CPU and GPU, reducing VRAM usage, and support pure CPU inference (initially, it was developed for CPU inference). They both are optimized for fast inference of LLMs and do their job pretty well. Note that they also have different quantization methods.

As for this project, we focus on optimizing inference for MoE-based models on consumer-class GPUs specifically. I can't tell you for sure right now when our method is faster/slower than the other ones, but we're currently researching that. It's also important to note that we used HQQ quantization, which is good but currently isn't very fast because it lacks good cuda kernels. Our team is actively working on supporting other quantization methods along with fast kernels and researching further possibilities to improve inference speed and quality.

Therefore, I believe our method is useful, at least if you don't have a lot of GPU VRAM (e.g., in Google Colab) or you want to fit a bigger model (with better quality) into it. We will do our best to implement new features and reach out to you as fast as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants