Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Porting INC SmoothQuant recipes to IPEX autotune API #1502

Open
xin3he opened this issue Dec 27, 2023 · 2 comments
Open

[RFC] Porting INC SmoothQuant recipes to IPEX autotune API #1502

xin3he opened this issue Dec 27, 2023 · 2 comments
Assignees

Comments

@xin3he
Copy link
Collaborator

xin3he commented Dec 27, 2023

https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/issues/2404

@xin3he
Copy link
Collaborator Author

xin3he commented Jan 3, 2024

Motivation

SmoothQuant is a popular method to improve the accuracy of int8 quantization. Intel-extension-for-pytorch (IPEX) already supports SmoothQuant and provides good optimizations in performance. Intel Neural Compressor (INC) provides finer-grained alpha tuning for the SmoothQuant algorithm, providing greater accuracy for LLMs like Llama2. Integrating this feature into IPEX for good accuracy and performance is a win-win.

Design

Original Interface

import intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.default_static_qconfig
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
    calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
    # Return accuracy value
    ...
    return accuracy
tuned_model = ipex.quantization.autotune(
                calibrated_model, calib_dataloader, eval_func,
                sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
            )
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
    traced_model = torch.jit.trace(quantized_model, example_input)
    traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)

New Interface for SmoothQuant

SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. Intel Neural Compressor inherits and enhances this functionality, allowing automatic global alpha tuning, and automatic layer-by-layer alpha tuning for the best INT8 accuracy.

Arguments Default Value Available Values Comments
alpha 'auto' [0-1] / 'auto' A value to balance input and weight quantization error.
init_alpha 0.5 [0-1] / 'auto' A value to get baseline quantization error for auto-tuning.
alpha_min 0.0 [0-1] min value of auto-tuning alpha search space
alpha_max 1.0 [0-1] max value of auto-tuning alpha search space
alpha_step 0.1 [0-1] step_size of auto-tuning alpha search space
shared_criterion "mean" ["min", "mean","max"] Criterion for input LayerNorm op of a transformer block.
enable_blockwise_loss False [True, False] Whether to enable block-wise auto-tuning.

Proposal 1

The calibration in previous code is redundant in prepare and autotune. Calibration is done before autotune but will be called again by autotune with calib_dataloader. So here we propose to simplify the code and save computing resources and time for users as shown below and do compatible changes to the original design.

Impact:

  • Few developer efforts and able to target IPEX version 2.2.
  • Compatible changes to the original design.
import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.5,
        "alpha_min": 0.0,
        "alpha_max": 1.0,
        "alpha_step": 0.1,
        "shared_criterion": "max",
        "enable_blockwise_loss": False,
    }
}
int8_tuned_model = ipex.quantization.autotune(
    model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
    sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)

Proposal 2 (Not recommended)

This proposal follows the previous design but it's not ready now. The main reason is that IPEX prepared model cannot do jit trace in INC internal SmoothQuant [JIRA] prepared model cannot do jit trace. INC uses jit trace to get the relationships of operations to detect which operations share the same input. So this proposal is not recommended.

Impact:

  • INC SmoothQuant designs to accept an eager model, but not an IPEX prepared model.
  • Dependency: [JIRA] prepared model cannot do jit trace
  • Quite a bit of development work to make this proposal work, cannot target IPEX version 2.2.
import intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
    calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
    # Return accuracy value
    ...
    return accuracy

# Set the tune space of SmoothQuant
smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.5,
        "alpha_min": 0.0,
        "alpha_max": 1.0,
        "alpha_step": 0.1,
        "shared_criterion": "max",
        "enable_blockwise_loss": False,
    }
}
tuned_model = ipex.quantization.autotune(
    calibrated_model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args
    sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
    traced_model = torch.jit.trace(quantized_model, example_input)
    traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)

Proposal 3 (Combination of Proposals 1 & 2) (Final choice)

The proposal partially follows the previous design but removes preparation and calibration. This design eliminates redundant calibration and skips the block issue after preparing.

Impact:

  • Few developer efforts and able to target IPEX version 2.2.
import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.5,
        "alpha_min": 0.0,
        "alpha_max": 1.0,
        "alpha_step": 0.1,
        "shared_criterion": "max",
        "enable_blockwise_loss": False,
    }
}
tuned_model = ipex.quantization.autotune(
    model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
    sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
    traced_model = torch.jit.trace(quantized_model, example_input)
    traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)

Commonly used smoothquant_args settings

auto global alpha tuning

smoothquant_args={
    "alpha": numpy.arange(0.0, 1.0, 0.1).tolist(),
}

auto layer-wise alpha tuning

smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.8,
        "alpha_min": 0.8,
        "alpha_max": 0.99,
        "alpha_step": 0.01,
        "shared_criterion": "mean",
        "enable_blockwise_loss": False,
    }
}

@xin3he xin3he removed the draft label Jan 3, 2024
@xin3he
Copy link
Collaborator Author

xin3he commented Jan 3, 2024

After meeting synchronization, we decided to take option3 to ensure the flexibility of post-processing after automatic tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants