Extend support for Phi-3 models #651

davidgxue · 2024-04-27T03:56:13Z

Description

Extend support for Microsoft's Phi-3 models

Technical Details

Very Straightforward implementation following adding custom models guide in the README
My only concern is Phi 3 seems to have fused the QKV and MLP modules. I looked into the code and it seems like it's actually not a big deal to just directly quantize them since, for example, qkv_proj is just a nn.Linear. However, I could be wrong, and would be great if someone wants to nudge me in the right direction.

Tests

I tested by making GPTQ quants on the 2 Phi-3-mini instruct models (4k and 128k context length). Both work ok with HF's text generation pipeline.
- Used wikitext for calibration dataset, 4096 seq length and 500 samples each.
I am following along vLLM team's discussion (there are some minor issues they are fixing) for Phi-3 support, but I think this should work ok once their sliding window assertion problem is fixed.

Side Note

Hi AutoGPTQ team, this is my first time contributing to AutoGPTQ library. Please feel free to guide me towards the right direction if needed. I couldn't find a contributing markdown file for guidance so just making this formatting as nice as possible.

Related Issues

closes #652

Qubitium · 2024-04-27T15:20:08Z

@davidgxue Did you have a stable avg losses and/or did you do ppl of pre-quant model vs post-quant model to see if the fused layers posed issue to quantizer? I know from dbrx tests that fused layers are really bad for quantization.

davidgxue · 2024-04-28T04:12:48Z

Yes, let me upload it. I only tested on 8 bit. I can do some more testing on 4 bits as well and come back to this.

davidgxue · 2024-04-30T19:37:37Z

Some delays have been encountered due to this #657. I am unable to get around the nan logits output or gibberish output due to some issues with our library's integration with transformers. Seems like this was fine when I quantized phi 3 but something changed

bhardwajsapna · 2024-05-21T07:14:22Z

Hey @davidgxue,
Thankyou for this contribution. I tried to this on local and installed auto-gptq with these changes. The model packing after quantization of layers is taking a lot of time. ETA: 10 hours or so. Was this the case for you too? Any ideas on why this is happening?

Qubitium · 2024-05-21T08:10:51Z

@bhardwajsapna Try my packing fix Pr #642. If you have lots of cores
, you may suffer something like 100x regression the more cores you have it worse it becomes.

initial attempt to add phi 3 (may need unfuse)

e1d9aba

davidgxue mentioned this pull request Apr 27, 2024

[PR Ready for Review] [FEATURE] Extend Support for Phi-3 #652

Open

davidgxue mentioned this pull request Apr 30, 2024

Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers model.generate() but works fine in vLLM? #657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend support for Phi-3 models #651

Extend support for Phi-3 models #651

davidgxue commented Apr 27, 2024 •

edited

Qubitium commented Apr 27, 2024

davidgxue commented Apr 28, 2024

davidgxue commented Apr 30, 2024

bhardwajsapna commented May 21, 2024

Qubitium commented May 21, 2024

Extend support for Phi-3 models #651

Are you sure you want to change the base?

Extend support for Phi-3 models #651

Conversation

davidgxue commented Apr 27, 2024 • edited

Description

Technical Details

Tests

Side Note

Related Issues

Qubitium commented Apr 27, 2024

davidgxue commented Apr 28, 2024

davidgxue commented Apr 30, 2024

bhardwajsapna commented May 21, 2024

Qubitium commented May 21, 2024

davidgxue commented Apr 27, 2024 •

edited