GitHub - Rumeysakeskin/ASR-Quantization: Post-training quantization on Nvidia Nemo ASR model

Speed Up Speech Recognition Inference on CPU Devices with Post-Training Quantization to Nvidia NeMo ASR Model

Model quantization is performance optimization technique that allows speeding up inference and decreasing memory requirements by performing computations and storing tensors at lower bitwidths (such as INT8 or FLOAT16) than floating-point precision. This is particularly beneficial during model deployment.

There are two model quantization methods,

Quantization Aware Training (QAT)
Post-training Quantization (PTQ)

QAT mimics the effects of quantization during training: The computations are carried-out in floating-point precision but the subsequent quantization effect is taken into account. The weights and activations are quantized into lower precision only for inference, when training is completed.

PTQ focuses on quantize the fine-tuned model without retraining. The weights and activations of ops are converted into lower precision for saving the memory and computation losses.

In this project, Post Training Static Quantization quantizes both weights and activations of the model statically. Follow below steps to aplly post training static quantization.

Pre-trained Model
Prepare
Fuse Modules
Insert Stubs and Observers
Calibration
Quantization

Prepare Quantization Backend for Hardware

PyTorch currently has two quantization backends that support quantization.

FBGEMM is specific to x86 CPUs and is intended for deployments of quantized models on server CPUs.
QNNPACK has a range of targets that includes ARM CPUs (typically found in mobile/embedded devices).

model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
torch.quantization.prepare(first_asr_model, inplace=True)

Insert Stubs to the inputs and outputs

# Insert the QuantStub() before the first layer of the model
model.quant = torch.quantization.QuantStub()
model.encoder.quant = torch.quantization.QuantStub()

# Insert a DeQuantStub() at the end of the model
model.dequant = torch.quantization.DeQuantStub()
model.decoder.dequant = torch.quantization.DeQuantStub()

Reference pages:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
quantization.ipynb		quantization.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

quantization.ipynb

quantization.ipynb

readme.md

readme.md

Repository files navigation

Speed Up Speech Recognition Inference on CPU Devices with Post-Training Quantization to Nvidia NeMo ASR Model

Prepare Quantization Backend for Hardware

Insert Stubs to the inputs and outputs

About

Releases

Packages

Languages

Rumeysakeskin/ASR-Quantization

Folders and files

Latest commit

History

Repository files navigation

Speed Up Speech Recognition Inference on CPU Devices with Post-Training Quantization to Nvidia NeMo ASR Model

Prepare Quantization Backend for Hardware

Insert Stubs to the inputs and outputs

About

Topics

Resources

Stars

Watchers

Forks

Languages