[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640

Qubitium · 2024-04-12T11:24:20Z

@qwopqwop200 This is the rebase of your PR at #559 with some modifications. Should be ready soon after we verify quantize, inference, and add some tests.

Reason For PR:

sym=False was practically unusable due to post-quantization avg_loss per layer/PPL vs sym=True . @qwopqwop200 fixed the bad/suboptimal math. Now sym=False will most likely match or decrease avg_loss/layer vs sym=True and improve post-quant PPL for many models.

Core Changes:

Rebase PR remove (zeors -= 1) #559 with main: allow usable sym=False quantization and use checkpoint_format=gptq_v2 to store new checkpoint format. Compat runtime dynamic convert of all checkpoint_format=gptq to gptq_v2 on load.

Misc Changes not directly related to sym=False code:

Complete Todo: Use accelerate 0.29.2 to load checkpoint.
Consistency: Move cohere/starcoder2 to 4.39.0 release check
Usability: Catch and alert user on how to fix quant/torch error caused by low damp/nsamples [BUG] torch._C._LinAlgError: linalg.cholesky always raised #572 as this happens much more frequently than I had expected. Ran into 2-3 instances of this error on multiple models during testing for this PR when using low nsamples + low damp=0.005 to speed up quants.
Simplify: optimized packing regression alert message/code to user
Feature: Quant Stat Log 1/2: store per layer quant stats (layer #, module name, avg loss, duration) in dict/slice and return to user va quant_log = model.quantize()
Feature: Quant Stat Log 2/2: pass saved quant_log to quantize(quant_log=saved_quant_log) to generate auto-avg_loss diff in progress. Sample diff output in later messages of this discussion.
Usability: Use tqdm for layer loop so users can have an estimate of quant remaining time.

TODO:

Add sym=False tests
Validate and fix failing tests
Failed: Check if third party vllm/sglang needs kernel modification for the new gptq_v2 format
Check if third party vllm/sglang needs kernel modification for the new gptq (v1) format using sym=False in this PR

PASSING TESTS:

Compat: vllm/sglang compatible with gptq (v1) generated with this PR
Compat load of checkpoint_format=gptq(v1)
sym=False consistently generate lower avg_loss than sym=True
Regression Test: sym=True in PR generates same math/avg_loss for layers as sym=True in main
test_serialization.py
test_quantization.py
test_shared_loading.py
test_awq_compatibility_generation.py (note: awq cache generated by main is not compatible with pr. fixed with version file name adding v2)
test_q4.py

FAILING TESTS:

compat: vllm/sglang not compatible gptq_v2
test_triton.py (never got this to work on main)
test_repacking.py (never got this to work on main)

Original PR #559 notes duplicated here for ref:

check work marlin

check work exllama

check work exllama2

check work qigen

check work triton

check work cuda

~~check work cuda old~~(There is a bug in autogptq's main unrelated to this pr, so it cannot be confirmed.)

check work cuda pytorch

check work cuda old pytorch

check support old version save

check support old version load

check support new version save

check support new version load

I am removing this line because it is not only computationally unnecessary, but also makes sym=False impossible. However, it breaks backwards compatibility, so I making it the default to use the old save format.

related pr

#354
#325
I removed the following line: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L169 This is an unnecessary line. And this line makes it impossible to sym='False' If sym='False' I get a reduction in ppl.

opt-125m(act-order) Bits group-size Wikitext2
sym=True 4 128 29.875
sym=False 4 128 29.221
llama2(act-order) Bits group-size Wikitext2
sym=True 4 128 5.254
sym=False 4 128 5.214

Qubitium · 2024-04-24T09:44:54Z

Fixed awq pack/unpack thread regression in cddbe23

unpack/packing to awq in test_awq now ~2.6x faster. 2m12s vs 5m46s.

Qubitium · 2024-04-24T11:16:05Z

All tests passing with transformers 4.40.1

…nsical

Qubitium · 2024-04-24T11:24:28Z

Removed highly flaky q4 test of Mixtral-Tiny, a model that doesn't exist in any production environment. It is a test that 1) fails at a very high rate 2) reference input/output are nonsensical and no human can understand what this is testing. Only the model creator (hf) knows what inside this black genie bottle in terms of weights.

811ca39

I have no clue how this is a good test beyond not failing mixtral code paths.

reference_output = """<s> I am in Paris andpublishedющиеcs performancesension manual offset亡VIDEO Kel RepubliczwDrawlichen LondresPSungspfn CreahooEESlider laughselvesлександTrytpl recallслу Ор coldsubset########serdeacion providestrm thoughts président oktobermulticol../редβ themselvesterraряд conflictscommandMass diagonal選 ptrTY還 Havepliedument relate redu"""

…oading

Qubitium · 2024-04-24T12:53:43Z

We no longer need use_unsafe_math param in from_quantized() now that we can detect producer from meta. So use_unsafe_math is now only in save_quantized() to handle cases of sym=False and save target is gptq v1.

2104cae

Qubitium · 2024-04-24T16:07:32Z

Latest quant test result:

model: command-r-v01

bfloat16

vllm ppl: 1.8202
transformers ppl: 1.8219

sym=False checkpoint_format=gptq (v1)

vllm ppl: 1.8313
autogptq ppl: 1.8282

sym=False checkpoint_format=gptq_v2

autogptq ppl: 1.8282

vllm and hf/autgptq have slightly different log_prob so not directly 1:1 but even with with vllm vs vllm, unsafe sym=False running on vllm with gptq v1 weights are ultra stable in all tests we have ran.

… save_quantized

Qubitium · 2024-04-24T16:24:24Z

Removed all use_unsafe_math code and default save to gptq (v1)format for max compat with minimal loss vs saving to v2 as tested in our internal ppl tests across 4 different models: tinyllama 1.1b + yi9b + command-r-v01 + llama3-8b.

@fxmarty Feel free to edit at will and revert the last commit if you want to go the conservative route. 528a8fc

Qubitium · 2024-04-24T16:38:47Z

@fxmarty So after all the back-and-forth with underflow/overflow, the current changes since your last review boils down to 3 changes:

meta field change added so we can actually distinguish v1 models produced using new/old code
better tests in test_quantization for sym=False, v2, and meta field.
threadpool limits applied to convert_v1/v2 code and awq unpack/pack code for huge performance boost in env with lots of cores

Qubitium · 2024-04-25T03:17:02Z

Intel/auto-round by default uses sym=False and save to gptq (v1) format by importing autogptq lib and directly calling pack() . As it stands, we are going to reject loading this model. I need to test this. My lord, testing for this PR never ends. I am going to pull my hair out.

{
  "bits": 4,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": false,
  "static_groups": false,
  "sym": false,
  "true_sequential": false,
  "model_name_or_path": null,
  "model_file_base_name": "model",
  "quant_method": "gptq",
  "checkpoint_format": "gptq",
  "meta": {
    "quantizer": "intel/auto-round:0.1",
    "iters": 10,
    "lr": 0.1,
    "minmax_lr": 0.1,
    "enable_minmax_tuning": true,
    "use_quant_input": true,
    "scale_dtype": "torch.float16"
  }
}

Qubitium · 2024-04-25T06:48:27Z

Intel/auto-round by default uses sym=False and save to gptq (v1) format by importing autogptq lib and directly calling pack() . As it stands, we are going to reject loading this model. I need to test this.

Autoround compat confirmed with WIP: fix compat with latest autogptq and use meta region to store auto-round properties intel/auto-round#87 and tested with both sym=True/False.

Add meta.packer so it is clear who is doing what. Autoround is doing the quantization and autogptq (imported code) is doing the packing to v1.

autoround model config with sym=False

{
  "bits": 4,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": false,
  "static_groups": false,
  "sym": false,
  "true_sequential": false,
  "model_name_or_path": null,
  "model_file_base_name": "model",
  "quant_method": "gptq",
  "checkpoint_format": "gptq",
  "meta": {
    "quantizer": "intel/auto-round:0.1",
    "packer": "autogptq:0.8.0.dev1",
    "iters": 20,
    "lr": 0.05,
    "minmax_lr": 0.05,
    "enable_minmax_tuning": true,
    "use_quant_input": true,
    "scale_dtype": "torch.float16"
  }
}

…und as example)

Qubitium · 2024-04-25T07:55:49Z

Commit/Refractor 59be4b3

Meta tooling fingerprints are now split into meta.quantizer and meta.packer as evident by intel/auto-round usage of autogptq packer. Models quantized by autogptq does not set meta.packer but it is checked for 3rd party tools that sets it. v1 sym=False loading checks both meta.quantizer and meta.packer to see if it was either quantized or packed by autogptq.

# sample auto-round meta field
 "meta": {
    "quantizer": "intel/auto-round:0.1",
    "packer": "autogptq:0.8.0.dev1",
}

# sample autogptq meta field
 "meta": {
    "quantizer": "autogptq:0.8.0.dev1",
}

wenhuach21 · 2024-05-15T08:55:37Z

We found a significant discrepancy between the real quantized model and the QDQ model at W2G32 asym. However, for the W2G32 sym and W4G128 asym, there doesn't seem to be severe issue, but sym is much worse than asym at W2G32 on qdq model.

As this PR also supports the v2 format and we are not familiar with the details, we're unsure if we can test it directly or if we need to merge the PR in Autoround.

Qubitium · 2024-05-15T09:57:02Z

@wenhuach21 Can you elaborate with more details?

Using which model/weights and code to reproduce the quantize discrepancies and such so that @qwopqwop200 can better look at the potential sym true vs false math discrepancies.

wenhuach21 · 2024-05-16T01:14:17Z

This is the results, we used autoround to generate the qdq model and real autogptq model by "--deployment_device fake,gpu", then we could evaluate the both model. lm-eval 0.4.2 is used
asym at w2g32
qdq: lambada_openai: 0.5405 winogrande: 0.6417
gpu: lambada_openai: 0.2505 winogrande: 0.6006

sym at w2g32
qdq: lambada_openai: 0.3501 winogrande: 0.5817
gpu: lambada_openai: 0.3629 winogrande: 0.5856

reference command
git clone https://github.com/intel/auto-round.git
cd auto-round/examples/language-modeling
pip install -r requirements.txt

python3 main.py --model_name /data5/llama3_8b_instruct/ --bits 2 --group_size 32 --n_samples 512 --iters 200 --deployment_device fake,gpu --disable_eval --seqlen 512 --minmax_lr 0.01 --scale_dtype fp32 --train_bs 4 --output_dir "./tmp_signround"

qwopqwop200 and others added 22 commits February 21, 2024 21:53

remove (zeors -= 1)

b33e1d0

add warning

6419a2d

support backwards compatibility

5ea98bb

support and fix bug

0b07292

remove not necessary parm

b015ae9

fix test_q4 bug

b7d9ade

fix test_q4 bug

83e510e

fix bug double converting

a89cb77

Update _utils.py

639e66a

Merge branch 'main' into main

15ecb0f

merge main AutoGPTQ#1

0759aea

FIX type error

101647d

module is nn.Module

28a2541

sync name

3899d5b

need return module

9f25c21

modify default format to gptq_v2

4159474

fix need return model

c853143

remove fixme and default to gptq_v2 for quantize_config

047dd97

save _qlinear_kernel and allow save to older format

869a162

fix name

365d961

pass quantize

e363d85

Merge remote-tracking branch 'origin/main' into sym-false

6af41db

qwopqwop200 mentioned this pull request Apr 12, 2024

remove (zeors -= 1) #559

Closed

13 tasks

Qubitium mentioned this pull request Apr 13, 2024

[BUG] torch._C._LinAlgError: linalg.cholesky always raised #572

Open

Qubitium force-pushed the sym-false branch 3 times, most recently from 8774a80 to dd72a76 Compare April 13, 2024 04:14

update

bd12bde

Qubitium force-pushed the sym-false branch from dd72a76 to bd12bde Compare April 13, 2024 04:22

store quant log/stats in dict slice and return to user in quantize()

cd916fe

fix awq unpack/repacking thread regression

cddbe23

remove highly flaky mistral tiny test with input/output that is nonse…

811ca39

…nsical

Qubitium requested review from fxmarty and qwopqwop200 April 24, 2024 11:45

now we can detect quant producer, we don't need use_unsafe_math for l…

2104cae

…oading

updat tests

bc2bf5b

default to gptq v1 for max compat and remove use_unsafe_math check in…

528a8fc

… save_quantized

Qubitium mentioned this pull request Apr 24, 2024

WIP: fix compat with latest autogptq and use meta region to store auto-round properties intel/auto-round#87

Draft

1 task

misc

464fc7e

separate the concept of meta.quantizer and meta.packer (intel/auto-ro…

59be4b3

…und as example)

clean

79050ea

Qubitium mentioned this pull request Apr 26, 2024

[FEATURE] Allow loading of quantized lm_head #648

Open

1 task

compact logic

33758ef

Qubitium force-pushed the sym-false branch from fdb0648 to 33758ef Compare April 27, 2024 06:45

Qubitium changed the title ~~Fix Sym=False, new checkpoint_format = gptq_v2~~ [BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 Apr 28, 2024

Qubitium mentioned this pull request Apr 29, 2024

[CORE] Allow loading of quantized lm_head (ParallelLMHead) vllm-project/vllm#4442

Open

wenhuach21 mentioned this pull request May 15, 2024

large discrepancy between GPTQ model and qdq model at W2 asym intel/auto-round#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640

[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640

Qubitium commented Apr 12, 2024 •

edited

Qubitium commented Apr 24, 2024

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

wenhuach21 commented May 15, 2024 •

edited

Qubitium commented May 15, 2024

wenhuach21 commented May 16, 2024 •

edited

[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640

Are you sure you want to change the base?

[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640

Conversation

Qubitium commented Apr 12, 2024 • edited

Reason For PR:

Core Changes:

Misc Changes not directly related to sym=False code:

related pr

Qubitium commented Apr 24, 2024

Qubitium commented Apr 24, 2024 • edited

Qubitium commented Apr 24, 2024 • edited

Qubitium commented Apr 24, 2024 • edited

Qubitium commented Apr 24, 2024 • edited

bfloat16

sym=False checkpoint_format=gptq (v1)

sym=False checkpoint_format=gptq_v2

Qubitium commented Apr 24, 2024 • edited

Qubitium commented Apr 24, 2024 • edited

Qubitium commented Apr 25, 2024 • edited

Qubitium commented Apr 25, 2024 • edited

Qubitium commented Apr 25, 2024 • edited

wenhuach21 commented May 15, 2024 • edited

Qubitium commented May 15, 2024

wenhuach21 commented May 16, 2024 • edited

Qubitium commented Apr 12, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 24, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

Qubitium commented Apr 25, 2024 •

edited

wenhuach21 commented May 15, 2024 •

edited

wenhuach21 commented May 16, 2024 •

edited