SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Model¶

约 604 个字 2 张图片预计阅读时间 2 分钟

Abstract¶

SmoothQuant 是一种用于 LLM 后训练量化的解决方案，旨在通过 8-bit 权重和 8-bit 激活(W8A8) 量化来减少内存使用并加速推理过程，同时保存模型的准确性。

核心思想是将量化的难点从 activation 迁移到 weights, 通过数学等价变换平滑 activation 的 outliers.

现有量化方法存在局限性：LLM 的参数超过 6.7B 时，激活中会出现异常值，导致量化误差以及准确性下降。
- weights 分布比较均匀，意味着它们的值在大多数情况下比较接近，更容易被量化。
- activation 分散一点，包含一些很大的值，这就是 outlier
SmoothQuant
- rely on a key observation: even if activations are mnuch harder to quantize than weights due to the presence of outliers, different tokens exhibit similar variations across their channels.
- SmoothQaunt offline migarates the quantization difficulty from activations to weights.

Activations are harder to quantize than weights.
- the weight distribution is quite uniform and flat
Outliers make actibation quantization difficult
- The scale of outliers in activations is ~100 \(\times\) larger than most of the activation values.
- the large outliers dominate the maximum magnitude measuremet, leading to low effective quantization bit for non-outlier channels.
Outliers persist in fixed channels.
- if one channel has an outlier, it persistently appears in all tokens.
- The variance between the magnitudes of a given channel across tokens is small.
- So if we could perform per-channel quantization of the activation, the quantization error would be much smaller.
- infeasible though.

总的 idea: 对于 \(Y=XW\) 将激活值缩小一个比例系数，同时将权重放大一个比例系数：

\[ \bf{Y}=(\bf{X}diag(\bf{s}^{-1}))\cdot (diag(\bf{s})\bf{W})=\hat{\bf{X}}\hat{\bf{W}} \]

Applying SmoothQuant to Transformer blocks.
- self-attention，FNN，Batched Matrix Multiply 被量化为 INT8，因为它们是计算密集型的。
- 其它轻量级逐元素操作，如 ReLU, Softmax, LayerNorm, 保持激活为 FP16

最后更新: 2024年10月29日 20:41:58
创建日期: 2024年10月29日 20:41:58