In recent years deep learning models have become huge, reaching hundreds of billions of parameters. Hence the need to reduce their size. Of course, there was the need to accomplish this task without resulting in a reduced accuracy. Enters quantization.

Background

As you might know, deep learning models eat numbers, both during training and inference. When the task has to do with images, we just note that images are nothing more than matrices of pixels, so we’re already good to go. When we have to manage text, things get a little tricky, but tokenization and embedding techniques can come to the aid.

The point is: at the end we always have numbers. More precisely, floating-point numbers.

Floating-point numbers are meant to represent real numbers. A floating-point number is typically indicated with a specific notation:

$$ s \times b^e $$

  • $s$ is the significand (also called fraction, mantissa, coefficient, argument etc. just to confuse our mind), whose length determines the precision to which numbers can be represented
  • $b$ is the base (also called radix)
  • $e$ is the exponent (one name here, as far as I know)

Why are they called “floating-point” numbers? The reason is that the radix point, i.e. the symbol that separates the integer part of a value from its fractional part, can “float” in any position according to the exponent.

Two important notes:

  • When you use a fixed number of bits (as in our case) there is a limit on the set of numbers that you can represent. In any range of real numbers, no matter how small it is, there are infinite numbers, while we can represent at most $2^n$ numbers
  • Although there are ways to speed it up, floating number arithmetic is much less efficient than integer arithmetic

Floating-point numbers in Deep Learning

The two most commonly used floating-point representations in deep learning are 32-bit (FP32) and 16-bit (FP16) floats. There are also double precision formats such as float64 (FP64), but they are not typically used since as you can imagine they allow for much more accurate results but at the cost of greater computational power, memory usage, and data transfer.

FP32 representation is called single precision and as the name suggests it uses 32 bits distributed as follows:

  • 1 sign bit, which is 0 for positive numbers and 1 for negative numbers
  • 8 bits represent the exponent
  • 23 bits represent the significand

FP16 representation is called half-precision and it uses 16 bits (half of the bits):

  • 1 sign bit, which is 0 for positive numbers and 1 for negative numbers
  • 5 bits represent the exponent
  • 10 bits represent the significand

Quantization

Now that we have seen a bit of theoretical background, it would be much easier to understand quantization.

The idea is very simple: reduce the model size by converting high-precision floating-point representation to low-precision floating-point or even integer representations. Empirically, it has been shown that even using the simple 8-bit (INT8) representation the accuracy of the model could not be highly affected, especially if weighted with the fact that we’re gaining a lot in terms of performance.

Is it that easy though?

Not really, in order to reach a successful quantization we need to consider other aspects:

  • it depends on the model we’re using
  • it typically requires extensive fine-tuning
  • it can in fact reduce the model accuracy in a significant way, especially if we go from a dynamic FP32 range to a range of 256 values as in the case of INT8
  • it needs to be supported by the hardware we’re using

Calibration

In practice, when we quantize weights/activations we are essentially multiplying the floating point value by some scale factor and rounding the result to a whole number.

Let’s suppose we want to go from FP32 to INT8. As we know, only 256 values can be represented in INT8. If $[a, b]$ is our FP32 range, we need to project it to the INT8 subspace.

If $x$ is our floating-point number, the quantized version becomes: $$ x_q = \text{clip}(\text{round}(x/S + z), \text{round}(a/S + Z), \text{round}(b/S + Z)) $$ where $S$ and $Z$ are the quantization parameters:

  • $S$ is the scale (FP32 value)
  • $Z$ is called the zero-point and it is the INT8 value that corresponds to the value $0$ in the FP32 realm

You can see how $S$ and $Z$ are computed in the paper Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

In the above formula, we’re projecting our floating-point values to the signed INT8 range $[-2^{b-1}, 2^{b-1} - 1] = [-128, 127]$. However, it’s common to use a symmetric version of this scheme, meaning that we want our final range to be in the form $[-a, a]$, i.e. $[-127, 127]$ in our example. Why? The reason is that doing so leads the zero-point $Z$ to be zero, and thus we can improve performance even more by skipping addition operations.

At this point one could ask: how is the $[a, b]$ range calculated?

  • for weights we know the range at quantization-time
  • for activations there exist different approaches

Post-training Quantization

As the name suggests, post-training quantization is applied once the model is already trained.

Post-training Dynamic Quantization

Dynamic range quantization is typically the recommended starting point because it can be easily applied without any extra effort. The model parameters are known and they are converted ahead of time and stored in INT8 form. Instead, the scale factor for activations is determined dynamically according to the data range seen at runtime.

PyTorch documentation says that

Arithmetic in the quantized model is done using vectorized INT8 instructions. Accumulation is typically done with INT16 or INT32 to avoid overflow. This higher precision value is scaled back to INT8 if the next layer is quantized or converted to FP32 for output.

while Tensorflow documentation says

The outputs are still stored using floating point so the increased speed of dynamic-range ops is less than a full fixed-point computation.

so I guess it depends on the framework you use.

One drawback of this type of quantization is that it can be a bit slower than the static quantization due to the fact that we’re introducing a computational overhead.

Post-training Static Quantization

Same as above, but the range for each activation is computed at quantization-time. This means we need to run a few inference cycles. As a result, the converter requires a representative dataset to calibrate them, which can be a small subset of the training or validation data.

Quantization-Aware Training (QAT)

Until now we’ve seen how to apply quantization as a kind of post-processing technique, after the model is trained. What if we embed quantization into the training process? This is exactly what Quantization-Aware Training does.

This approach is similar to the static one, except for fact that the range for each activation is computed at training-time. Instead of just observing the values resulting from inference, we use their quantized version to let the model adapt to it. This typically allows the model to retain much of its original accuracy. Moreover, QAT allows for finer-grained control over the quantization process, as the kind of quantization can be set according to the layers’ sensitivity to quantization errors.

Conclusions

Quantization techniqueData requirementsSize reductionAccuracy
post-training dynamic range quantizationno dataup to 75%smallest accuracy loss
post-training static quantizationunlabelled representative sampleup to 75%small accuracy loss
quantization-aware traininglabelled training dataup to 75%smallest accuracy loss

There are much more details about quantization, and as you can image in the years researchers have dug deeper and deeper trying to squeeze every drop from the rock. Moreover, each framework has its own peculiarities and features. Hence, if you want to use quantization for your project, take a look at the documentation of the tools/libraries you’re using to see what is possible and what is not.

References