Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Apr 12, 2024

@qwopqwop200 This is the rebase of your PR at #559 with some modifications. Should be ready soon after we verify quantize, inference, and add some tests.

Reason For PR:

sym=False was practically unusable due to post-quantization avg_loss per layer/PPL vs sym=True . @qwopqwop200 fixed the bad/suboptimal math. Now sym=False will most likely match or decrease avg_loss/layer vs sym=True and improve post-quant PPL for many models.

Core Changes:

  1. Rebase PR remove (zeors -= 1) #559 with main: allow usable sym=False quantization and use checkpoint_format=gptq_v2 to store new checkpoint format. Compat runtime dynamic convert of all checkpoint_format=gptq to gptq_v2 on load.

Misc Changes not directly related to sym=False code:

  1. Complete Todo: Use accelerate 0.29.2 to load checkpoint.
  2. Consistency: Move cohere/starcoder2 to 4.39.0 release check
  3. Usability: Catch and alert user on how to fix quant/torch error caused by low damp/nsamples [BUG] torch._C._LinAlgError: linalg.cholesky always raised #572 as this happens much more frequently than I had expected. Ran into 2-3 instances of this error on multiple models during testing for this PR when using low nsamples + low damp=0.005 to speed up quants.
  4. Simplify: optimized packing regression alert message/code to user
  5. Feature: Quant Stat Log 1/2: store per layer quant stats (layer #, module name, avg loss, duration) in dict/slice and return to user va quant_log = model.quantize()
  6. Feature: Quant Stat Log 2/2: pass saved quant_log to quantize(quant_log=saved_quant_log) to generate auto-avg_loss diff in progress. Sample diff output in later messages of this discussion.
  7. Usability: Use tqdm for layer loop so users can have an estimate of quant remaining time.

TODO:

  • Add sym=False tests
  • Validate and fix failing tests
  • Failed: Check if third party vllm/sglang needs kernel modification for the new gptq_v2 format
  • Check if third party vllm/sglang needs kernel modification for the new gptq (v1) format using sym=False in this PR

PASSING TESTS:

  • Compat: vllm/sglang compatible with gptq (v1) generated with this PR
  • Compat load of checkpoint_format=gptq(v1)
  • sym=False consistently generate lower avg_loss than sym=True
  • Regression Test: sym=True in PR generates same math/avg_loss for layers as sym=True in main
  • test_serialization.py
  • test_quantization.py
  • test_shared_loading.py
  • test_awq_compatibility_generation.py (note: awq cache generated by main is not compatible with pr. fixed with version file name adding v2)
  • test_q4.py

FAILING TESTS:

  • compat: vllm/sglang not compatible gptq_v2
  • test_triton.py (never got this to work on main)
  • test_repacking.py (never got this to work on main)

Original PR #559 notes duplicated here for ref:

  • check work marlin
  • check work exllama
  • check work exllama2
  • check work qigen
  • check work triton
  • check work cuda
  • check work cuda old(There is a bug in autogptq's main unrelated to this pr, so it cannot be confirmed.)
  • check work cuda pytorch
  • check work cuda old pytorch
  • check support old version save
  • check support old version load
  • check support new version save
  • check support new version load

I am removing this line because it is not only computationally unnecessary, but also makes sym=False impossible. However, it breaks backwards compatibility, so I making it the default to use the old save format.

related pr

#354
#325
I removed the following line: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L169 This is an unnecessary line. And this line makes it impossible to sym='False' If sym='False' I get a reduction in ppl.

opt-125m(act-order) Bits group-size Wikitext2
sym=True 4 128 29.875
sym=False 4 128 29.221
llama2(act-order) Bits group-size Wikitext2
sym=True 4 128 5.254
sym=False 4 128 5.214

@Qubitium
Copy link
Contributor Author

Fixed awq pack/unpack thread regression in cddbe23

unpack/packing to awq in test_awq now ~2.6x faster. 2m12s vs 5m46s.

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 24, 2024

All tests passing with transformers 4.40.1

  • test_q4 (multiple times to succeed with super flaky mixtral-tiny)
  • test_quantization
  • test_serialization
  • test_sharded_load
  • test_awq_compat

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 24, 2024

Removed highly flaky q4 test of Mixtral-Tiny, a model that doesn't exist in any production environment. It is a test that 1) fails at a very high rate 2) reference input/output are nonsensical and no human can understand what this is testing. Only the model creator (hf) knows what inside this black genie bottle in terms of weights.

811ca39

I have no clue how this is a good test beyond not failing mixtral code paths.

reference_output = """<s> I am in Paris andpublishedющиеcs performancesension manual offset亡VIDEO Kel RepubliczwDrawlichen LondresPSungspfn CreahooEESlider laughselvesлександTrytpl recallслу Ор coldsubset########serdeacion providestrm thoughts président oktobermulticol../редβ themselvesterraряд conflictscommandMass diagonal選 ptrTY還 Havepliedument relate redu"""

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 24, 2024

We no longer need use_unsafe_math param in from_quantized() now that we can detect producer from meta. So use_unsafe_math is now only in save_quantized() to handle cases of sym=False and save target is gptq v1.

2104cae

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 24, 2024

Latest quant test result:

model: command-r-v01

bfloat16

vllm ppl: 1.8202
transformers ppl: 1.8219

sym=False checkpoint_format=gptq (v1)

vllm ppl: 1.8313
autogptq ppl: 1.8282

sym=False checkpoint_format=gptq_v2

autogptq ppl: 1.8282

vllm and hf/autgptq have slightly different log_prob so not directly 1:1 but even with with vllm vs vllm, unsafe sym=False running on vllm with gptq v1 weights are ultra stable in all tests we have ran.

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 24, 2024

Removed all use_unsafe_math code and default save to gptq (v1)format for max compat with minimal loss vs saving to v2 as tested in our internal ppl tests across 4 different models: tinyllama 1.1b + yi9b + command-r-v01 + llama3-8b.

@fxmarty Feel free to edit at will and revert the last commit if you want to go the conservative route. 528a8fc

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 24, 2024

@fxmarty So after all the back-and-forth with underflow/overflow, the current changes since your last review boils down to 3 changes:

  • meta field change added so we can actually distinguish v1 models produced using new/old code
  • better tests in test_quantization for sym=False, v2, and meta field.
  • threadpool limits applied to convert_v1/v2 code and awq unpack/pack code for huge performance boost in env with lots of cores

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 25, 2024

Intel/auto-round by default uses sym=False and save to gptq (v1) format by importing autogptq lib and directly calling pack() . As it stands, we are going to reject loading this model. I need to test this. My lord, testing for this PR never ends. I am going to pull my hair out.

{
  "bits": 4,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": false,
  "static_groups": false,
  "sym": false,
  "true_sequential": false,
  "model_name_or_path": null,
  "model_file_base_name": "model",
  "quant_method": "gptq",
  "checkpoint_format": "gptq",
  "meta": {
    "quantizer": "intel/auto-round:0.1",
    "iters": 10,
    "lr": 0.1,
    "minmax_lr": 0.1,
    "enable_minmax_tuning": true,
    "use_quant_input": true,
    "scale_dtype": "torch.float16"
  }
}

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 25, 2024

Intel/auto-round by default uses sym=False and save to gptq (v1) format by importing autogptq lib and directly calling pack() . As it stands, we are going to reject loading this model. I need to test this.

Add meta.packer so it is clear who is doing what. Autoround is doing the quantization and autogptq (imported code) is doing the packing to v1.

autoround model config with sym=False

{
  "bits": 4,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": false,
  "static_groups": false,
  "sym": false,
  "true_sequential": false,
  "model_name_or_path": null,
  "model_file_base_name": "model",
  "quant_method": "gptq",
  "checkpoint_format": "gptq",
  "meta": {
    "quantizer": "intel/auto-round:0.1",
    "packer": "autogptq:0.8.0.dev1",
    "iters": 20,
    "lr": 0.05,
    "minmax_lr": 0.05,
    "enable_minmax_tuning": true,
    "use_quant_input": true,
    "scale_dtype": "torch.float16"
  }
}

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 25, 2024

Commit/Refractor 59be4b3

Meta tooling fingerprints are now split into meta.quantizer and meta.packer as evident by intel/auto-round usage of autogptq packer. Models quantized by autogptq does not set meta.packer but it is checked for 3rd party tools that sets it. v1 sym=False loading checks both meta.quantizer and meta.packer to see if it was either quantized or packed by autogptq.

# sample auto-round meta field
 "meta": {
    "quantizer": "intel/auto-round:0.1",
    "packer": "autogptq:0.8.0.dev1",
}

# sample autogptq meta field
 "meta": {
    "quantizer": "autogptq:0.8.0.dev1",
}

@Qubitium Qubitium changed the title Fix Sym=False, new checkpoint_format = gptq_v2 [BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 Apr 28, 2024
@wenhuach21
Copy link

wenhuach21 commented May 15, 2024

We found a significant discrepancy between the real quantized model and the QDQ model at W2G32 asym. However, for the W2G32 sym and W4G128 asym, there doesn't seem to be severe issue, but sym is much worse than asym at W2G32 on qdq model.

As this PR also supports the v2 format and we are not familiar with the details, we're unsure if we can test it directly or if we need to merge the PR in Autoround.

@Qubitium
Copy link
Contributor Author

@wenhuach21 Can you elaborate with more details?

Using which model/weights and code to reproduce the quantize discrepancies and such so that @qwopqwop200 can better look at the potential sym true vs false math discrepancies.

@wenhuach21
Copy link

wenhuach21 commented May 16, 2024

This is the results, we used autoround to generate the qdq model and real autogptq model by "--deployment_device fake,gpu", then we could evaluate the both model. lm-eval 0.4.2 is used
asym at w2g32
qdq: lambada_openai: 0.5405 winogrande: 0.6417
gpu: lambada_openai: 0.2505 winogrande: 0.6006

sym at w2g32
qdq: lambada_openai: 0.3501 winogrande: 0.5817
gpu: lambada_openai: 0.3629 winogrande: 0.5856

reference command
git clone https://github.com/intel/auto-round.git
cd auto-round/examples/language-modeling
pip install -r requirements.txt

python3 main.py --model_name /data5/llama3_8b_instruct/ --bits 2 --group_size 32 --n_samples 512 --iters 200 --deployment_device fake,gpu --disable_eval --seqlen 512 --minmax_lr 0.01 --scale_dtype fp32 --train_bs 4 --output_dir "./tmp_signround"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants