Compacting an AI model to run faster. AI quantization is primarily performed at the inference side (user side) so that it can run more quickly in phones and desktop computers. For example, whereas the model's weights (parameters) may be 32-bit floating point numbers in the training stage, they might be reduced to 16-bit floating point or 8-bit integers. Even 4-bit floating point numbers might be used (see
FP4).
Pruning and Sparcity
Whereas quantization reduces the parameter values, pruning actually removes parameters and/or neurons to make the model more compact. Sparcity changes the parameters to zero to make the model more efficient. See
AI training vs. inference,
AI weights and biases and
floating point.