New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Porting INC SmoothQuant recipes to IPEX autotune API #1502
Comments
MotivationSmoothQuant is a popular method to improve the accuracy of int8 quantization. Intel-extension-for-pytorch (IPEX) already supports SmoothQuant and provides good optimizations in performance. Intel Neural Compressor (INC) provides finer-grained alpha tuning for the SmoothQuant algorithm, providing greater accuracy for LLMs like Llama2. Integrating this feature into IPEX for good accuracy and performance is a win-win. DesignOriginal Interfaceimport intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.default_static_qconfig
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
# Return accuracy value
...
return accuracy
tuned_model = ipex.quantization.autotune(
calibrated_model, calib_dataloader, eval_func,
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
traced_model = torch.jit.trace(quantized_model, example_input)
traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x) New Interface for SmoothQuantSmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. Intel Neural Compressor inherits and enhances this functionality, allowing automatic global alpha tuning, and automatic layer-by-layer alpha tuning for the best INT8 accuracy.
Proposal 1The calibration in previous code is redundant in Impact:
import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.5,
"alpha_min": 0.0,
"alpha_max": 1.0,
"alpha_step": 0.1,
"shared_criterion": "max",
"enable_blockwise_loss": False,
}
}
int8_tuned_model = ipex.quantization.autotune(
model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
) Proposal 2 (Not recommended)This proposal follows the previous design but it's not ready now. The main reason is that IPEX prepared model cannot do jit trace in INC internal SmoothQuant [JIRA] prepared model cannot do jit trace. INC uses jit trace to get the relationships of operations to detect which operations share the same input. So this proposal is not recommended. Impact:
import intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
# Return accuracy value
...
return accuracy
# Set the tune space of SmoothQuant
smoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.5,
"alpha_min": 0.0,
"alpha_max": 1.0,
"alpha_step": 0.1,
"shared_criterion": "max",
"enable_blockwise_loss": False,
}
}
tuned_model = ipex.quantization.autotune(
calibrated_model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
traced_model = torch.jit.trace(quantized_model, example_input)
traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x) Proposal 3 (Combination of Proposals 1 & 2) (Final choice)The proposal partially follows the previous design but removes preparation and calibration. This design eliminates redundant calibration and skips the block issue after preparing. Impact:
import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.5,
"alpha_min": 0.0,
"alpha_max": 1.0,
"alpha_step": 0.1,
"shared_criterion": "max",
"enable_blockwise_loss": False,
}
}
tuned_model = ipex.quantization.autotune(
model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
traced_model = torch.jit.trace(quantized_model, example_input)
traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x) Commonly used smoothquant_args settingsauto global alpha tuningsmoothquant_args={
"alpha": numpy.arange(0.0, 1.0, 0.1).tolist(),
} auto layer-wise alpha tuningsmoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.8,
"alpha_min": 0.8,
"alpha_max": 0.99,
"alpha_step": 0.01,
"shared_criterion": "mean",
"enable_blockwise_loss": False,
}
} |
After meeting synchronization, we decided to take option3 to ensure the flexibility of post-processing after automatic tuning. |
https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/issues/2404
The text was updated successfully, but these errors were encountered: