-
传统基于逐点匹配的稠密光流法计算量过大,在高超声速飞行器制导控制这一应用场景下,难以满足高实时性的需求。基于此考虑,文中采用特征点匹配的方法提取空中目标的角点信息,抑制静态背景噪声,将特征集中在运动区域中,生成掩膜图像。之后利用LK金字塔计算高动态目标的光流信息,并根据计算结果设计二维稀疏光流特征图,最后应用3D卷积提取时序特征。
-
目标的角点位于目标边缘的交点处,在角点处任意方向上的微小移动都会导致梯度方向和幅值的大幅变化,因此,采用特征角点匹配的方法可以在减少计算量的同时有效替代基于逐个像素点匹配的稠密光流法。面向高动态空中目标检测场景时,由于背景复杂,真实目标角点信息会被淹没在大量的虚假目标角点中,从而降低特征角点提取的准确性。文中在Shi-Tomasi角点提取法[16]的基础上通过图像掩模筛选出运动物体,使特征角点汇聚到运动区域上,避免大量错误计算,提升特征角点提取的效率。
获取图像掩模首先需计算连续两帧视频序列图像It−1和It的帧间差分:
$$ {D_t}(x,y) = \left| {{I_t}(x,y) - {I_{t - 1}}(x,y)} \right| $$ (1) 式中:(x,y)为图像中像素坐标;Dt(x,y)为帧间差分图。
为方便后续计算,且为了尽可能在保留空中运动目标区域的同时筛去非运动区域目标,即噪声区域,文中设置了一个较大的阈值λ,对帧间差分图进行二值化处理:
$$ {B_t}(x,y) = \left\{ \begin{array}{l} 255 , {\rm{if}}\; {D_t}(x,y) \gt \lambda \\ 0 , \;{\rm{ else}} \\ \end{array} \right. $$ (2) 式中:Bt(x,y)为二值化后的帧间差分图。由于阈值λ设置较大,为避免将真实目标错误掩盖,文中对Bt(x,y)求局部最大值:
$$ {M_t} = \mathop {\max }\limits_{ - k \leqslant {x'} \leqslant k, - k \leqslant {y'} \leqslant k,{x'} \ne 0,{y'} \ne 0} {I_t}(x + {x'},y + {y'}) $$ (3) 其中,Mt为最终生成的图像掩模,滤波核大小为(2k+1)×(2k+1)。
-
光流法广泛应用于连续帧图像目标检测任务中,核心思想是对于t时刻图像上的像素点I(x,y),找到t+1时刻该像素点在各个方向的位移量。在高动态场景中,LK光流法的时间连续假设不再成立。基于此,文中应用LK金字塔结构对高动态运动目标光流进行处理,如图4所示。
当像素点高速运动时,对图像进行金字塔分层,每次将图像缩放为原始大小的一半,将分辨率高的图像置于金字塔底层,分辨率低的图像置于金字塔顶层。图像缩放的目的主要是减少像素的位移,从而使得LK光流法的时间连续假设得以满足。算法首先对顶层图像进行计算,将结果作为初始值传递至下一层,下一层的图像在此基础上计算光流和前后两帧间的仿射变化矩阵,再依次将这一层的光流和仿射矩阵继续向下传递,直至传递到最底层的原始图像层。通过自顶向下的迭代计算可实现对高速运动目标的光流求解,并根据计算结果设计二维稀疏光流特征图。
-
在二维光流特征图的基础上,文中利用如图5所示的3D卷积模块对特征图进行卷积运算,提取目标动态时序特征。
每组卷积模块由3D卷积算子、批量归一化(Batch Normalization,BN)层和ELU(Exponential Linear Units)激活函数组成。其中,BN层在训练时对每个batch中特征图的每个通道进行0均值、l标准差的归一化,ELU激活函数能够使得神经元的平均激活值趋近于0,并且对噪声更具有鲁棒性,有利于提取高动态目标特征。
-
依靠高超声速飞行器搭载的红外探测器探测目标时,由于相对速度很大,目标在短时间内会发生较大位移,同时大小、形态明显变化。由于条件限制,实验室无法利用高超声速飞行器对空中红外目标进行实时检测。基于此考虑,文中构建了一个包含1500张大小为640×512的常速运动红外无人机(UAV)连续帧图像序列,并选取多帧间隔、包含多尺度、多形态目标图像作为实验测试集,背景包括建筑、树木、云朵等,以模拟复杂背景空中目标检测场景。为验证文中方法的有效性,选取了三组测试集中包含建筑、空中飞鸟(点噪声)、云层等干扰的连续帧图像进行对比实验,对比算法包括C3D[17]、TSN[18]、ECO[19]、3DLocalCNN[20]、TAda[21]。
文中根据识别准确率、实时性和计算资源评估算法性能,如表1所示。其中识别准确率Accuracy=(TP+TN)/(TP+TN+FP+FN),即所有正确预测为正样本的数据与正确预测为负样本的数据数量占总样本的比值;算法实时性指标FPS(Frames Per Second)表示网络每秒可处理图像帧数;算法运行占用资源Run memory以GB(Gigabyte)计算。
表 1 不同算法在自建数据集上识别效果对比
Table 1. Comparison of detection performance of different algorithms on self-built dataset
结合对连续帧图像的识别结果以及表1可以看出,TSN、ECO、3DLocalCNN、TADa四种方法可以较好地识别无人机目标,但存在对空中点噪声以及云层背景的大量误检,虚警率很高;C3D方法对于背景噪声的抑制较好,但无法对连续帧图像中的目标实时跟踪,存在丢帧的现象,识别准确率低。文中提出的基于深度空时域特征融合的目标识别方法能够有效抑制复杂背景中的噪声信息,大幅降低虚警率;保持实时性的同时目标识别准确率达到了89.87%,优于现有基于时空域特征融合的目标识别算法。
为验证所提出的基于深度学习的目标识别方法相比传统方法的优势,选取PSTNN[1]、NRAM[2]、TDLMS[3]和Top-hat[4]四种传统方法对图6中的三组连续帧图像进行测试,实验结果如图7所示。
图 6 三组连续帧无人机目标识别结果对比
Figure 6. Comparison of target recognition results of UAV in three consecutive frames
由传统方法无人机目标识别结果可以看出,PSTNN误检较少,但只能滤出无人机发动机、旋翼等高温位置,无法整体检出目标,当目标与背景重叠时目标检测效果差;NRAM同样无法整体检测出无人机目标,且当背景中存在大量高温物体时,检测效果差;TDMLS能以较高准确率提取出运动目标,但存在明显的运动轨迹,影响识别效果;Top-hat能将目标准确滤出,但存在大量误检,虚警率过高。
以上对时空域融合方法与传统方法的分析证明了文中方法在高超声速飞行器制导场景中的有效性,满足高动态下红外目标智能检测识别的需求。
Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)
-
摘要: 针对复杂背景下,依靠高超声速飞行器搭载的红外探测器对高动态空中目标的可靠探测和精确识别问题,提出了一种基于深度空时域特征融合的空中多形态目标检测方法。设计了加权双向循环特征金字塔结构提取多形态目标静态特征,并引入可切换空洞卷积,增大感受野的同时减少空域信息损失。对于时序运动特征的提取,为了抑制复杂背景噪声的同时将角点信息集中到运动区域中,通过特征点匹配法生成掩膜图,之后进行光流计算,根据计算结果设计稀疏光流特征图,利用3D卷积提取多个连续帧图像中包含的时序特征,生成三维时序运动特征图。最后,通过对图像静态特征与时序运动特征进行通道维度的拼接,实现深度空时域特征融合。大量的对比实验表明,文中方法可明显减少复杂背景下的虚假识别概率,具备高实时性的同时目标识别准确率达89.87%,满足高动态下的红外目标智能检测识别需求。Abstract: Aiming at the problem of reliable detection and accurate recognition of high dynamic aerial targets by infrared detectors carried by hypersonic vehicles in complex background, an aerial polymorphic target detection method based on deep spatial-temporal feature fusion was proposed. A weighted bidirectional cyclic feature pyramid structure was designed to extract the static features of polymorphic target, and switchable atrous convolution was introduced to increase the receptive field and reduce spatial information loss. For the extraction of temporal motion features, in order to suppress the complex background noise and concentrate the corner information into the moving region, the feature point matching method was used to generate the mask image, then the optical flow was calculated, and the sparse optical flow feature map was designed according to calculation results. Finally, the temporal features contained in multiple continuous frame images were extracted by 3D convolution to generate a 3D temporal motion feature map. By concatting the image static features and temporal motion features in channel dimension, the deep spatial-temporal fusion could be realized. A large number of comparative experiments showed that this method can significantly reduce the false recognition probability in complex background, and the target detection accuracy reached 89.87% with high real-time performance, which can meet the needs of infrared targets intelligent detection and recognition under high dynamic conditions.
-
Key words:
- object detection /
- feature fusion /
- multi-scale pyramid /
- sparse optical flow /
- 3D convolution
-
-
[1] Jiang Taixiang, Huang Tingzhu, Zhao Xile, et al. Multi-dimensional imaging data recovery via minimizing the partial sum of tubal nuclear norm [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [2] Zhang Landan, Peng Lingbing, Zhang Tianfang, et al. Infrared small target detection via non-convex rank approximation minimization joint l2, 1 norm [J]. Remote Sensing, 2018, 10(11): 1821. doi: 10.3390/rs10111821 [3] Hadhoud M M, Thomas D W. The two-dimensional adaptive LMS (TDLMS) algorithm [J]. IEEE Transactions on Circuits and Systems, 1988, 35(5): 485-494. doi: 10.1109/31.1775 [4] Bai Xiangzhi, Zhou Fugen. Analysis of new top-hat transformation and the application for infrared dim small target detection [J]. Pattern Recognition, 2010, 43(6): 2145-2156. doi: 10.1016/j.patcog.2009.12.023 [5] Zhao Lu, Xiong Sen. Target recognition based on multi-view infrared images [J]. Infrared and Laser Engineering, 2021, 50(11): 20210206. (in Chinese) [6] Tang Peng, Liu Yi, Wei Hongguang, et al. Automatic recognition algorithm of digital instrument reading in offshore booster station based on Mask-RCNN [J]. Infrared and Laser Engineering, 2021, 50(S2): 20211057. (in Chinese) [7] Beery S, Wu G, Rathod V, et al. Context R-CNN: Long term temporal context for per-camera object detection [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 13075-13085. [8] Li Jingyu, Yang Jing, Kong Bin, et al. Multi-scale vehicle and pedestrian detection algorithm based on attention mechanism [J]. Optics and Precision Engineering, 2021, 29(6): 1448-1458. (in Chinese) doi: 10.37188/OPE.20212906.1448 [9] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos [J]. Advances in Neural Information Processing Systems, 2014, 27: 568-576. [10] Zhang Hongying, An Zheng. Human action recognition based on improved two-stream spatiotemporal network [J]. Optics and Precision Engineering, 2021, 29(2): 420-429. (in Chinese) doi: 10.37188/OPE.20212902.0420 [11] Donahue J, Hendricks L A, Guadarrama S, et a1. Long-term recurrent convolutional networks for visual recognition and description [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 2625-2634. [12] Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231. [13] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 6299-6308. [14] Wu Haibin, Wei Xiying, Liu Meihong, et al. Improved YOLOv4 for dangerous goods detection in X-ray inspection combined with atrous convolution and transfer learning [J]. Chinese Optics, 2021, 14(6): 1417-1425. (in Chinese) doi: 10.37188/CO.2021-0078 [15] Zhang Ruiyan, Jiang Xiujie, An Junshe, et al. Design of global-contextual detection model for optical remote sensing targets [J]. Chinese Optics, 2020, 13(6): 1302-1313. (in Chinese) doi: 10.37188/CO.2020-0057 [16] Shi J, Tomasi C. Good features to track [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1994: 593-600. [17] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks [C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2014: 1725-1732. [18] Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition [C]//European conference on computer vision. Springer (ECCV), 2016: 20-36. [19] Zolfaghari M, Singh K, Brox T. Eco: Efficient convolutional network for online video understanding [C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 695-712. [20] Huang Zhen, Xue Dixiu, Shen Xu, et al. 3D local convolutional neural networks for gait recognition [C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021: 14920-14929. [21] Huang Ziyuan, Zhang Shiwei, Pan Liang, et al. TAda! Temporally-adaptive convolutions for video understanding [C]//International Conference on Learning Representations (ICLR), 2022.