-
量化将浮点数转换为定点数进行运算,在推理过程中只使用整数,因此对于硬件实现更加友好高效。文中在表1所示的两种量化方法中选择,其中,
$ round\left(x\right) $ 函数为四舍五入,乘积运算因子$ s $ 定义为$ s={w}_{max}/\left({2}^{bi-1}-1\right) $ ,代表着量化运算前后的缩放尺度;$ fl={b}_{i}-1- ⌈{\mathrm{log}}_{2}{w}_{max}⌉ $ 为量化的移位因子,代表量化运算的移位步长。移位量化方法缩小了量化运算带来的功耗损失,在提高运算效率的同时增大了精度损失。使用YOLOV5 s网络在VOC2007数据集上分别对两种方法进行了测试,结果如表2所示。乘积量化方法在低位宽量化时相较于移位量化方法整体拥有更高的精度,随着量化位宽的降低,两种方法的精度损失逐步增大,移位量化方法的精度损失下降更快,乘积量化方法在7位量化时仍有不错的精度。表 1 乘积量化方法与移位量化方法
Table 1. Product quantization method and shift quantization method
Quantitative method Operation ${q}\left(w,{b}_{i}\right)=round\left(w/s\right)$ Multiplication ${q}\left(w,{b}_{i}\right)=round\left(w×{2}^{fl}\right)$ Displacement 表 2 不同量化方法在VOC2007数据集上的表现
Table 2. The performance of different quantification methods on the VOC2007 dataset
Network model Dataset bit mAP.5-.95 Displacement Multiplication YOLOV5 s VOC 8 63.4% 77.9% 7 26.5% 68.8% 6 4.6% 39.5% 32 81.8% -
硬件设备一般存储卷积神经网络中的权重和偏置等信息,在观察网络中各层权重分布时,有一些离散值的存在,数量很小,但影响该层权重的最大值,增加了权重分布范围,进而影响了量化后网络的准确率。因此,对权重数据先截断再量化,可以减少量化带来的精度损失。如图1(c)中所示,YOLOV5 s前20层中实际采用的截断值约为原始最大值的1/2。文中采用截断操作优化乘积运算的量化方法,具体来说,该方法逐层将权值截断至
$ \left[-c,c\right] $ 的范围,后将其量化至$ {b}_{i} $ 位。量化方法为:$$ {q}\left(w,{b}_{i},c\right)=round\left(clamp\left(w,c\right)/s\right) $$ (1) 式中:
$ \mathrm{c}\mathrm{l}\mathrm{a}\mathrm{m}\mathrm{p}\left(w,c\right) $ 表示将权值$ w $ 截断到$ \left[-c,c\right] $ 的范围;$ {b}_{i} $ 为第i层量化位宽。截断值c的选择如下:$$ {c}={}_{x}{}^{argmin}{d}_{MSE}\left(w\left|\right|{w}_{q}\left(w,x,s\right)\right) $$ (2) 式中:
$ {d}_{MSE} $ 表示原始权值分布和量化模拟后的权重分布之间的均方误差。文中在YOLOV5 s网络对VOC2007数据集使用截断方法前后的性能进行了测试对比,结果如表3所示,其中,MAX代表最大值量化,MSE表示基于均方误差(mean squared error, MSE)的截断量化方法。可以看出,截断方法能够有效控制量化误差,在各个量化位宽,相较于无截断方法均有一定的精度提升,且位宽越低,性能对比越明显,当量化位宽为5 bit和6 bit时,采用MSE截断量化方法比无截断量化方法性能分别提升了27.7%和22.3%,说明基于均方误差的量化方法能有效恢复检测精度。
表 3 不同方法量化前后的网络准确率
Table 3. Network accuracy before and after quantization with different truncation methods
bit 8 7 6 5 32 mAP MAX 78.9% 67.4% 46.7% 4.0% 82.6% MSE 82.7% 76.0% 69.0% 31.7% -
量化过程中的误差主要来自于两个方面:取整损失与截断损失。截断操作的引入可以缩小量化区间,减少量化过程中的取整损失。在此分析软件端模拟量化过程中的取整损失。
输入量化操作:
$$ {q}_{in}=round\left({in}/ {{s}_{1}}\right) $$ (3) 权重量化操作:
$$ {q}_{w}=round\left({w}/ {{s}_{2}}\right) $$ (4) 偏置量化操作:
$$ {q}_{b}=round\left({bias}/ {{s}_{1}\times {s}_{2}}\right) $$ (5) 激活量化操作为:
$$ {q}_{activ}=round\left({activ}/{{s}_{3}}\right) $$ (6) $$ activ=({q}_{in}\otimes {q}_{w}+{q}_{b}) \times {s}_{1}\times {s}_{2} $$ (7) 由于四舍五入取整操作带来的损失在0.5以内,因此输入量化、权重量化以及激活量化操作带来的误差均在
$ 0.5\times \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e} $ 范围内。且与$ \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e} $ 大小成正相关。文中认为,每一层对最终结果的影响都是相同的,因此,笔者课题组可以选择一种误差限制的量化策略,通过对卷积层的放缩因子$ \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e} $ 的大小进行限制,得到不同卷积层的量化精度。使用γ作为卷积层的误差限制参数,初始时设置所有卷积层的量化位宽为8,如果放缩因子$ \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e} $ 小于γ,则该层位宽减1,直至该层放缩因子$ \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e} $ 超过γ。整体流程如图2所示。文中设计循环计算获得最佳γ,初始位宽设置为8,逐层推理并调整位宽大小,从而确定最佳位宽策略。由于卷积层中最大值出现在激活部分,因此使用激活部分放缩因子
$ \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e} $ 与γ进行对比,如果$ \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e} $ 小于γ,则该层位宽减1,直至该层放缩因子超过γ,进入下一卷积层。笔者课题组在YOLOV5 s训练VOC2007数据集的量化方法探索中,测试了不同γ值对于量化结果的影响,如表4所示。γ可设置在
$ \left[\mathrm{0.08,0.25}\right] $ 之间,在最佳点之前,网络精度随着压缩率的增加在不断波动,达到最佳点之后,随着压缩率的增加,网络精度整体呈现表 4 误差限制参数γ取值对比
Table 4. Error limit parameter γ value comparison
γ Compression radio Average bit mAP 0.08 4.93 6.49 79.6% 0.10 5.13 6.23 77.8% 0.125 5.74 5.57 72.3% 0.142 6.11 5.23 62.8% 0.166 6.31 5.07 63.3% 0.20 7.14 4.48 21.0% 下降趋势,相较于表3,VOC数据集统一量化到7位和5位的精度分别为76%和31.7%,混合量化到6.49位和5.07位的精度分别为79.6%和63.3%。混合精度量化方法均能在更小的模型中达到更高的检测精度。
-
由于大部分网络的激活函数都使用ReLU函数,因此每一层的输出函数的分布范围为
$ \left[0,c\right] $ 。对于ReLU激活函数的缩放因子s,文中定义为$ {s}=c/({2}^{bi}-1) $ ,对于Leaky ReLU、SiLU等值域涉及负数的激活函数,文中使用${s}=c/({2}^{bi-1}-1)$ 作为量化放缩因子。 -
部分卷积神经网络拥有链接模块与残差模块。链接模块用于在指定维度链接两个张量,残差模块拥有张量相加操作。
链接操作:
$$ {r}_{3}=concat\left[{q}_{activ1},{q}_{activ2}\right] $$ (8) 相加操作:
$$ {r}_{3}=add\left[{q}_{activ1},{q}_{activ2}\right] $$ (9) 文中在神经网络中涉及多个张量之间的操作时,对多个张量采用统一的缩放因子与量化位宽。即:
$$ {b}_{i}=max\left({b}_{i1},{b}_{i2}\right) $$ (10) $$ \mathrm{s}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{e}=max\left({scale}_{1},{scale}_{2}\right) $$ (11) -
为了验证文中的方法,笔者课题组在搭载显存为24 GB的Geforce RTX 3090的服务器上进行实验,系统环境为Ubuntu16.04,Pytorch版本为1.10.1,Torchvision版本为0.4.2,Python版本为3.6.0,网络学习率为
$ {10}^{-4} $ ,YOLOV5网络结构使用已有的Pytorch实现方法与提供的YOLOV5 s.pt预训练模型。 -
YOLOV5网络是目前单阶段目标检测网络中性能最好的网络之一,文中在COCO数据集和VOC2011数据集上对统一量化方法与基于误差限制的混合截断量化方法进行了对比测试,如表5所示。
表 5 不同量化方法对COCO数据集和VOC2011数据集的测试结果
Table 5. Test results of different quantification methods on COCO dataset and VOC2011 dataset
Dataset Method bit γ mAP@0.5 mAP@0.5-0.95 Model size COCO Unified bit 7 0.567 0.345 6.35 6 0.503 0.301 5.45 5 0.386 0.215 4.54 Mixed bit 6.49 0.08 0.602 0.368 5.89 5.57 0.125 0.546 0.322 5.05 5.07 0.166 0.446 0.260 4.60 Ori model 32 0.636 0.411 29.07 VOC2011 Unified bit 7 0.950 0.732 6.35 6 0.925 0.643 5.45 5 0.533 0.295 4.54 Mixed bit 6.49 0.08 0.950 0.706 5.89 5.57 0.125 0.981 0.669 5.05 5.07 0.166 0.782 0.456 4.60 Ori model 32 0.950 0.786 29.07 在COCO数据集,与统一6位量化(模型为5.45 MB)相比,文中的方法在5.05 MB时拥有更高的精度,性能分别提升了3.3% (mAP@0.5)和2.1% (mAP@0.5-0.95),与统一5位量化相比,文中在相似的模型大小下性能分别提升了6% (mAP@0.5)和4.5% (mAP@0.5-0.95)。
在VOC2011数据集,相比全精度模型,文中的方法将模型压缩到5.05 MB的同时mAP@0.5提升了3.1%,部分图片的检测效果有所提升;与统一精度量化方法相比,文中的方法在相似的模型大小(统一5位量化与平均5位量化)下性能分别提升了24.9%(mAP@0.5)和16.1% (mAP@0.5-0.95)。
整体上看,随着量化位宽的降低,统一精度量化方法和混合精度量化方法带来的精度损失都在逐步增加,但文中的方法能够将量化误差平均分配在各个卷积层,减少了部分层误差较大情况的出现,因此精度下降较缓,整体上相对于统一精度量化方法拥有更高的精度。表6展示了VOC2011数据集采用不同方法进行检测的不同类别的精度,可以看出,混合精度量化方法与统一精度量化方法相比,Dog类别精度有所下降,Bird、Chair、Sheep、Train类别精度相同,Aeroplane、Bicycle、Boat、Bottle、Person、Tvmonitor类别精度均有所上升。图3展示了YOLOV5 s对COCO数据集采用不同量化方法的检测结果,其中,图3(a)为训练真值,标识出了全部待检测目标;图3(b)为全精度检测结果;图3(c)为采用文中的方法进行混合6位量化的检测结果;图3(d)为统一6位量化方法的检测结果。从图中可以看出,当γ为0.09时,模型压缩到5.43 MB,即平均6位量化,对目标图像的检测效果与原始网络相当,而统一6位量化方法会丢失一些小目标与背景模糊物体的检测框。如第一张图片丢失了小目标背包的检测框,第二张图片丢失了桌子上的部分小目标检测框,对被遮挡人物的检测也有所损失。第三张图片中则损失了两辆背景模糊的车辆检测框。在第二张图片中,原始网络对于中间人物的检测出现精度损失,而6位混合量化方法恢复了人物的检测损失。
表 6 VOC2011数据集类别精度检测表
Table 6. VOC2011 dataset category accuracy detection table
Dataset Method bit mAP@0.5 Aeroplane Bicycle Bird Boat Bottle Chair Dog Person Sheep Train Tvmonitor VOC2011 Unite 5 0.782 0.753 0.435 0.497 0.995 0.801 0.995 0.249 0.897 0.995 0.995 0.995 Mixed 0.533 0.232 0.324 0.497 0.484 0.209 0.995 0.332 0.455 0.995 0.995 0.34
Mixed-precision quantization for neural networks based on error limit (Invited)
-
摘要: 基于卷积神经网络的深度学习算法展现出卓越性能的同时也带来了冗杂的数据量和计算量,大量的存储与计算开销也成了该类算法在硬件平台部署过程中的最大阻碍。而神经网络模型量化使用低精度定点数代替原始模型中的高精度浮点数,在损失较小精度的前提下可有效压缩模型大小,减少硬件资源开销,提高模型推理速度。现有的量化方法大多将模型各层数据量化至相同精度,混合精度量化则根据不同层的数据分布设置不同的量化精度,旨在相同压缩比下达到更高的模型准确率,但寻找合适的混合精度量化策略仍十分困难。因此,提出一种基于误差限制的混合精度量化策略,通过对神经网络卷积层中的放缩因子进行统一等比限制,确定各层的量化精度,并使用截断方法线性量化权重和激活至低精度定点数,在相同压缩比下,相比统一精度量化方法有更高的准确率。其次,将卷积神经网络的经典目标检测算法YOLOV5s作为基准模型,测试了方法的效果。在COCO数据集和VOC数据集上,该方法与统一精度量化相比,压缩到5位的模型平均精度均值(mean Average Precision, mAP)分别提高了6%和24.9%。Abstract: The deep learning algorithm based on convolutional neural network exhibits excellent performance, but also brings a complex amount of data and calculation. A large amout of storage and computing overhead has alse become the biggest obstacle to the deployment of such algorithms in hardware platforms.The neural network model quantization uses low-precision fixed-point numbers instead of high-precision floating-point numbers in the original model, which can effectively compress the model size, reduce hardware resource overhead, and improve model inference speed on the premise of losing less precision. Most of the existing quantization methods quantize the data of each layer to the same accuracy, while mixed-precision quantization sets different quantization accuracy according to the data distribution of different layers, aiming to achieve a higher model accuracy under the same compression ratio, but finding a suitable mixed-precision quantization strategy is still very difficult. Therefore, a mixed-precision quantization strategy based on error limitation was proposed. By uniformly and proportionally limiting the scaling factors in each layer of the neural network, the quantization accuracy of each layer was determined, and the truncation method was used to linearly quantize the weights and activate to low-precision fixed-point numbers. Under the same compression radio, this method had higher accuracy than the unified precision quantization method. Secondly, the classical object detection algorithm YOLOV5s based on convolutional neural network was used as the benchmark model to test the effect of the method. On the COCO data set and VOC data set, compared with the unified precision quantization, the mean average precision (mAP) of the model compressed to 5 bits was improved by 6% and 24.9%.
-
Key words:
- deep learning /
- mixed precision /
- truncated quantization /
- YOLOV5
-
图 1 (a)深度学习卷积8位量化过程[6];(b)YOLOV5 s网络前20层权重最值分布趋势;(c)YOLOV5 s网络量化过程中的激活最大值与截断值分布
Figure 1. (a) Photograph of deep learning convolutional 8-bit quantization procession[6]; (b) The distribution trend of the most valued weights in the first 20 layers of the YOLOV5 s network; (c) Distribution of activation maximum and cutoff value during network quantization in YOLOV5 s
表 1 乘积量化方法与移位量化方法
Table 1. Product quantization method and shift quantization method
Quantitative method Operation ${q}\left(w,{b}_{i}\right)=round\left(w/s\right)$ Multiplication ${q}\left(w,{b}_{i}\right)=round\left(w×{2}^{fl}\right)$ Displacement 表 2 不同量化方法在VOC2007数据集上的表现
Table 2. The performance of different quantification methods on the VOC2007 dataset
Network model Dataset bit mAP.5-.95 Displacement Multiplication YOLOV5 s VOC 8 63.4% 77.9% 7 26.5% 68.8% 6 4.6% 39.5% 32 81.8% 表 3 不同方法量化前后的网络准确率
Table 3. Network accuracy before and after quantization with different truncation methods
bit 8 7 6 5 32 mAP MAX 78.9% 67.4% 46.7% 4.0% 82.6% MSE 82.7% 76.0% 69.0% 31.7% 表 4 误差限制参数γ取值对比
Table 4. Error limit parameter γ value comparison
γ Compression radio Average bit mAP 0.08 4.93 6.49 79.6% 0.10 5.13 6.23 77.8% 0.125 5.74 5.57 72.3% 0.142 6.11 5.23 62.8% 0.166 6.31 5.07 63.3% 0.20 7.14 4.48 21.0% 表 5 不同量化方法对COCO数据集和VOC2011数据集的测试结果
Table 5. Test results of different quantification methods on COCO dataset and VOC2011 dataset
Dataset Method bit γ mAP@0.5 mAP@0.5-0.95 Model size COCO Unified bit 7 0.567 0.345 6.35 6 0.503 0.301 5.45 5 0.386 0.215 4.54 Mixed bit 6.49 0.08 0.602 0.368 5.89 5.57 0.125 0.546 0.322 5.05 5.07 0.166 0.446 0.260 4.60 Ori model 32 0.636 0.411 29.07 VOC2011 Unified bit 7 0.950 0.732 6.35 6 0.925 0.643 5.45 5 0.533 0.295 4.54 Mixed bit 6.49 0.08 0.950 0.706 5.89 5.57 0.125 0.981 0.669 5.05 5.07 0.166 0.782 0.456 4.60 Ori model 32 0.950 0.786 29.07 表 6 VOC2011数据集类别精度检测表
Table 6. VOC2011 dataset category accuracy detection table
Dataset Method bit mAP@0.5 Aeroplane Bicycle Bird Boat Bottle Chair Dog Person Sheep Train Tvmonitor VOC2011 Unite 5 0.782 0.753 0.435 0.497 0.995 0.801 0.995 0.249 0.897 0.995 0.995 0.995 Mixed 0.533 0.232 0.324 0.497 0.484 0.209 0.995 0.332 0.455 0.995 0.995 0.34 -
[1] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks [J]. Communications of the ACM, 2017, 60(6): 84-90. doi: 10.1145/3065386 [2] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition [J]. arXiv preprint, 2014: 1409.1556. [3] He K, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778. [4] Vanhoucke V, Senior A, Mao M Z. Improving the speed of neural networks on CPUs[C]//Advances in Neural Information Processing Systems, 2011. [5] Gupta S, Agrawal A, Gopalakrishnan K, et al. Deep learning with limited numerical precision[C]//International Conference on Machine Learning, 2015, 37: 1737-1746. [6] Jacob B, Kligys S, Chen B, et al. Quantization and training of neual networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 2704-2713. [7] Cai Y H, Yao Z W, Dong Z, et al. ZeroQ: A novel zero shot quantization framework[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 13169-13178. [8] Wang K, Liu Z J, Lin Y J, et al. HAQ: Hardware-aware automated quantization with mixed precision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 8612-8620. [9] Huang Z Z, Du H M, Chang L B. Mixed-clipping quantization for convolutional neural networks [J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(4): 553-559. (in Chinese) doi: 10.3724/SP.J.1089.2021.18509 [10] Zeng H Q, Hu H L, Lin X W, et al. Deep neural network compression and acceleration: An overview [J]. Journal of Signal Processing, 2022, 38(1): 183-194. (in Chinese) doi: 10.16798/j.issn.1003-0530.2022.01.021 [11] Chen W L, Wilson J T, Tyree S, et al. Compressing neural networks with the hashing trick[C]//32nd International Conference on Machine Learning, 2015, 37: 2285-2294. [12] Liu Z, Li J G, Shen Z Q, et al. Learning efficient convolutional networks through network sliming[C]//2017 IEEE International Conference on Computer Vision (ICCV), 2017: 2775-2763. [13] Xu Y F, Zhang D Z, Wang L, et al. Lightweight feature fusion network design for local feature recognition of non-cooperative target [J]. Infrared and Laser Engineering, 2020, 49(7): 20200170. (in Chinese) doi: 10.3788/IRLA20200170 [14] Lin M, Ji R, Wang Y, et al. HRank: Filter pruning using high-rank feature map[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 1529-1538 [15] He Y, Ding Y, Liu P, et al. Learning filter pruning criteria for deep convolutional neural networks acceleration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2020: 2006-2015. [16] Han S, Mao H, Dally W J. Deep compression: Compressing deep neural networks with pruning trained quantization and huffman coding[C]//Conference on Computer Vision and Pattern Recognition, 2016. [17] Gong R, Liu X, Jiang S, et al. Differentiable soft quantization: Bridging full-precision and low-bit neural networks[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2019: 4852-4861. [18] Zhu F, Gong R, Yu F, et al. Towards unified int8 training for convolutional neural network[C]//Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), Virtual, 2020: 1969-1979. [19] Redmon J, Farhadi A. YOLOV3: An incremental improvement [J]. arXiv, 2018: 1804.02767. doi: 10.48550/arXiv.1804.02767