Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend support for Phi-3 models #651

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

davidgxue
Copy link

@davidgxue davidgxue commented Apr 27, 2024

Description

Technical Details

  • Very Straightforward implementation following adding custom models guide in the README
  • My only concern is Phi 3 seems to have fused the QKV and MLP modules. I looked into the code and it seems like it's actually not a big deal to just directly quantize them since, for example, qkv_proj is just a nn.Linear. However, I could be wrong, and would be great if someone wants to nudge me in the right direction.

Tests

  • I tested by making GPTQ quants on the 2 Phi-3-mini instruct models (4k and 128k context length). Both work ok with HF's text generation pipeline.
    • Used wikitext for calibration dataset, 4096 seq length and 500 samples each.
  • I am following along vLLM team's discussion (there are some minor issues they are fixing) for Phi-3 support, but I think this should work ok once their sliding window assertion problem is fixed.

Side Note

  • Hi AutoGPTQ team, this is my first time contributing to AutoGPTQ library. Please feel free to guide me towards the right direction if needed. I couldn't find a contributing markdown file for guidance so just making this formatting as nice as possible.

Related Issues

closes #652

@Qubitium
Copy link
Contributor

@davidgxue Did you have a stable avg losses and/or did you do ppl of pre-quant model vs post-quant model to see if the fused layers posed issue to quantizer? I know from dbrx tests that fused layers are really bad for quantization.

@davidgxue
Copy link
Author

Yes, let me upload it. I only tested on 8 bit. I can do some more testing on 4 bits as well and come back to this.

@davidgxue
Copy link
Author

Some delays have been encountered due to this #657. I am unable to get around the nan logits output or gibberish output due to some issues with our library's integration with transformers. Seems like this was fine when I quantized phi 3 but something changed

@bhardwajsapna
Copy link

Hey @davidgxue,
Thankyou for this contribution. I tried to this on local and installed auto-gptq with these changes. The model packing after quantization of layers is taking a lot of time. ETA: 10 hours or so. Was this the case for you too? Any ideas on why this is happening?

@Qubitium
Copy link
Contributor

@bhardwajsapna Try my packing fix Pr #642. If you have lots of cores
, you may suffer something like 100x regression the more cores you have it worse it becomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PR Ready for Review] [FEATURE] Extend Support for Phi-3
3 participants