How to Perform Quantization in Machine Learning (Math Behind It)
Quantization is a MUST Step to Fine-Tune Large Language Models
Suppose you are fitting a large number of books into a small suitcase. You can't take all of them, so you must decide which ones to bring and which to leave behind. This process of selecting and compressing data is quite similar to what we do in machine learning when we perform quantization.
Quantization is a technique for reducing the number of bits needed to represent data. This reduces the number of bits needed to compress models, making them faster and more efficient.
In this article, we'll delve into the concept of quantization, its types, and how to perform it effectively.
What is Quantization and Why is it Important?
Quantization in the context of machine learning refers to the process of mapping a large set of input values to a smaller set.
This is primarily done to reduce the computational and memory requirements of machine learning models, making them more efficient without significantly sacrificing accuracy.
Why Quantization Matters
Efficiency: Quantized models run faster and consume less power, which is crucial for deploying models on edge devices like smartphones and IoT devices.
Storage: Reducing the size of models means they take up less storage space, which is beneficial for both cloud storage and local storage on devices.
Latency: Faster models result in lower latency, which is critical for real-time applications like autonomous driving and interactive AI systems.
You might be wondering, How?
Consider a neural network model that uses 32-bit floating-point numbers to represent weights. If we quantize these weights to 8-bit integers, the model size is reduced by a factor of four.
This reduction can lead to significant improvements in speed and efficiency, especially when deploying the model on resource-constrained devices.
Types of Quantization
Quantization can be broadly classified into two types:
Symmetric and,
Asymmetric.
Symmetric Quantization
In symmetric quantization, the range of the input values is symmetrically mapped around zero. This means that the positive and negative ranges are equal.
The formula for symmetric quantization is:
Q(x) = round(x / scale)
Where:
Q(x)
is the quantized value.x
is the original value.scale
is a factor that determines the range of the quantized values.
Asymmetric Quantization
In asymmetric quantization, the range of the input values is not symmetrically mapped around zero. This means that the positive and negative ranges can be different.
The formula for asymmetric quantization is:
Q(x) = round((x - zero_point) / scale)
Where:
Q(x)
is the quantized value.
x
is the original value.zero_point
is an offset that shifts the range of the input values.scale
is a factor that determines the range of the quantized values.
How to Perform Quantization
Symmetric Quantization
Determine the Range: Calculate the range of the input values.
Calculate the Scale Factor: The scale factor is determined by the maximum absolute value of the input range divided by the maximum value of the quantized range.
scale = max(|input_range|) / max(quantized_range)
Quantize the Values: Apply the symmetric quantization formula to convert the original values to quantized values.
Q(x) = round(x / scale)
Example
Let's say we have an input range of [-10, 10] and we want to quantize it to 8-bit signed integers, which have a range of [-128, 127].
Determine the Range:
The input range is [-10, 10].
Calculate the Scale Factor:
scale = max(|input_range|) / max(quantized_range) scale = 10 / 127 ā 0.0787
Quantize the Values:
Q(x) = round(x / scale)
Now, let's break this down:
The scale factor (0.0784) maps the input range [-10, 10] to the full range of 8-bit integers [-128, 127].
This scaling allows us to use the full range of the 8-bit representation, maximizing precision.
Let's see how some values would be quantized:
Q(0) = round(0 / 0.0787) = 0
Q(10) = round(10 / 0.0787) ā round(127.07) = 127
Q(-10) = round(-10 / 0.0787) ā round(-127.07) = -127
Q(5) = round(5 / 0.0787) ā round(63.53) = 64
Q(-7.5) = round(-7.5 / 0.0787) ā round(-95.3) = -95
To dequantize, we would use:
x = Q(x) * scale
This symmetric quantization:
Maintains zero at zero
Scales the input range to utilize the full range of the quantized values
Preserves the sign of the input
š”Note:
This approach is often used in machine learning, especially for quantizing weights and activations in neural networks, as it allows for efficient storage and computation while minimizing information loss.
Asymmetric Quantization
Determine the Range: Calculate the range of the input values.
Calculate the Zero Point: The zero point is calculated to map the minimum input value to the minimum quantized value
zero_point = round(min(quantized_range) - min(input_range) / scale)
Calculate the Scale Factor: The scale factor is determined by the difference between the maximum and minimum values of the input range divided by the maximum value of the quantized range.
scale = (max(input_range) - min(input_range)) / max(quantized_range)
Quantize the Values: Apply the asymmetric quantization formula to convert the original values to quantized values.
Q(x) = round((x - zero_point) / scale)
Example
Let's say we have an input range of [10, 260] and we want to quantize it to 8-bit unsigned integers, which have a range of [0, 255].
Determine the Range:
The input range is [10, 260].
Calculate the Scale Factor:
scale = (max(input_range) - min(input_range)) / max(quantized_range) scale = (260 - 10) / 255 = 250 / 255 ā 0.9804
Calculate the Zero Point:
zero_point = round(min(quantized_range) - min(input_range) / scale) zero_point = round(0 - 10 / 0.9804) ā round(-10.2) = -10
Quantize the Values:
Q(x) = round((x - zero_point) / scale) Q(x) = round((x + 10) / 0.9804)
Now, let's break this down:
The scale factor (0.9804) maps the input range [10, 260] to the full range of 8-bit integers [0, 255].
The zero point (-10) shifts the input range to start at 0 for the quantized values.
Let's see how some values would be quantized:
Q(10) = round((10 + 10) / 0.9804) ā round(20.4) = 20 (minimum quantized value)
Q(260) = round((260 + 10) / 0.9804) ā round(275.4) = 275 (which becomes 255 due to the uint8 range)
Q(135) = round((135 + 10) / 0.9804) ā round(147.9) = 148
To dequantize, we would use:
x = (Q(x) * scale) + zero_point
This asymmetric quantization:
Handles input ranges that don't start at zero
Scales and shifts the input range to utilize the full range of the quantized values
Preserves the relative distances between values in the original range
This approach is particularly useful in scenarios where:
The input range is not centred around zero
You're dealing with strictly positive (or negative) values
You want to maximize the use of the available quantization range
š” Note:
Asymmetric quantization is often used in machine learning for quantizing activations or input data that have a skewed or non-zero-centred distribution, allowing for more efficient storage and computation while minimizing information loss.
Conclusion
Quantization is a powerful technique that can significantly improve the efficiency of machine learning models. By understanding the types of quantization and how to perform them, you can make your models faster and more efficient without sacrificing too much accuracy.
Key Takeaways
Quantization reduces the number of bits needed to represent information, making models more efficient.
There are two main types of quantization: symmetric and asymmetric.
Symmetric quantization maps input values symmetrically around zero, while asymmetric quantization does not.
The key components of quantization are the zero point and the scale factor.
Properly applying quantization can lead to significant improvements in model efficiency and performance.
Have you found this article useful? Please let me know in the comments.
Please consider ā¤ liking this article. Also, you can support me here.
Connect: LinkedIn | Gumroad Shop | Medium | GitHub
Subscribe: Substack Newsletter | Appreciation Tip: Support