Siamese networks tracking algorithm integrating channel-interconnection-spatial attention

Cui Zhoujuan; An Junshe; Cui Tianshu

doi:10.3788/IRLA20200148

The tracking algorithms based on the Siamese networks show great potential in terms of tracking accuracy and speed. However, it is still challenging to adapt the offline trained model to online tracking. In order to improve the feature extraction and discrimination ability of the algorithm in complex scenes, a Siamese network real-time tracking algorithm that combines channel, interconnection and spatial attention mechanisms was proposed. First a Siamese tracking framework with a deep convolutional network VGG-Net-16 as the backbone network was built to increase feature extraction capabilities; then the channel-interconnection-spatial attention module was integrated to enhance the adaptability and discrimination capabilities of the model; then the multi-layer response maps were weighted and fused to obtain more accurate tracking results; and finally the large-scale datasets were used to train the end-to-end network, and tracking test on the benchmark OTB-2015 was completed. The experimental results show that compared with the current mainstream algorithms, the proposed algorithm is more robust and better adapt to complex scenes such as target appearance changes, similar distractors, and occlusion. On the NVIDIA RTX 2060 GPU, the average tracking speed reaches 37FPS, which meets real-time requirements.

HTML

[1]	Ma C, Huang J B, Yang X, et al. Hierarchical convolutional features for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 3074−3082.
[2]	Danelljan M, Robinson A, Khan F S, et al. Beyond correlation filters: Learning continuous convolution operators for visual tracking[C]//ECCV, 2016.
[3]	Danelljan M, Bhat G, Khan F S, et al. Eco: Efficient convolution operators for tracking[C]//CVPR, 2017.
[4]	Tao Ran, Gavves E, Smeulders A W M. Siamese instance search for tracking[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1420−1429.
[5]	Bertinetto L, Valmadre J, Henriques J F, et al. Fully-Convolutional Siamese Networks for Object Tracking[M]//Hua G, Jegou H. Computer Version ECCV 2016 Workshops. Cham: Springer, 2016, 9914: 850−865.
[6]	Valmadre J, Bertinetto L, Henriques J F, et al. End-to-end representation learning for correlation filter based tracking[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5000−5008.
[7]	Li Bo, Yan Junjie, Wu Wei, et al. High performance visual tracking with Siamese region proposal network[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 8971−8980.
[8]	Zhu Zheng, Wang Qiang, Li Bo, et al. Distractor-aware Siamese networks for visual object tracking[C]//The 15th European Conference on Computer Vision, 2018: 103−119.
[9]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2018-12-15]. https://arxiv.org/abs/1409.1556.
[10]	Bromley J, Guyon I, LeCun Y, et al. Signature verification using a “Siamese” time delay neural network[C]//Advances in Neural Information Processing Systems, 1994: 737−744.
[11]	Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks[C]//CVPR, 2015.
[12]	Wang N, Shi J, Yeung D, et al. Understanding and Diagnosing Visual Tracking Systems[C]//2015 IEEE International Conference on Computer Vision (ICCV), 2015: 3101−3109.
[13]	黎万义, 王鹏, 乔红. 引入视觉注意机制的目标跟踪方法综述[J]. 自动化学报, 2014, 40(4): 561-576. Li Wanyi, Wang Peng, Qiao Hong. A survey of visual attention based methods for object tracking [J]. Acta automatica sinica, 2014, 40(4): 561-576. (in Chinese)
[14]	Woo S, Park J, Lee J Y, etal. CBAM: Convolutional Block Attention Module[C]// European Conference on Computer Vision, 2018: 3−19.
[15]	Russakovsky O, Deng J, Su H, et al. Image net large scale visual recognition challenges[C]//IJCV, 2015.
[16]	Lin T-Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//ECCV, 2014: 740−755.
[17]	Real E, Shlens J, Mazzocchi S, et al. Youtube boundingboxes: A large high-precision human-annotated data set for object detection in video[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017: 7464−7473.
[18]	Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 618−626.
[19]	Wu Y, Lim J, Yang M H. Object tracking benchmark [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1834-1848. doi: 10.1109/TPAMI.2014.2388226

Layer name	Kernel size	Chan. map	Template size	Search size	Channel output	Stride	CISAM
Input			127×127	255×255	3	-	No
Conv1_1	3×3	64×3	125×125	253×253	64	1	No
Conv1_2	3×3	64×64	123×123	251×251	64	1	No
Pool1	2×2		61×61	125×125	64	2	No
Conv2_1	3×3	128×64	59×59	123×123	128	1	No
Conv2_2	3×3	128×128	57×57	121×121	128	1	Yes
Pool2	2×2		28×28	60×60	128	2	No
Conv3_1	3×3	256×128	26×26	58×58	256	1	No
Conv3_2	3×3	256×256	24×24	56×56	256	1	No
Conv3_3	3×3	256×256	22×22	54×54	256	1	Yes
Pool3	2×2		11×11	27×27	256	2	No
Conv4_1	3×3	512×256	9×9	25×25	512	1	No
Conv4_2	3×3	512×512	7×7	23×23	512	1	No
Conv4_3	3×3	512×512	5×5	21×21	512	1	Yes

Device	Product model	Memory
CPU	Intel（R）Core（TM）i7-9700	16G
	Basic frequency 3.0 GHz
GPU	NVIDIA GeForce RTX-2060	6G

[1]	Ma C, Huang J B, Yang X, et al. Hierarchical convolutional features for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 3074−3082.
[2]	Danelljan M, Robinson A, Khan F S, et al. Beyond correlation filters: Learning continuous convolution operators for visual tracking[C]//ECCV, 2016.
[3]	Danelljan M, Bhat G, Khan F S, et al. Eco: Efficient convolution operators for tracking[C]//CVPR, 2017.
[4]	Tao Ran, Gavves E, Smeulders A W M. Siamese instance search for tracking[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1420−1429.
[5]	Bertinetto L, Valmadre J, Henriques J F, et al. Fully-Convolutional Siamese Networks for Object Tracking[M]//Hua G, Jegou H. Computer Version ECCV 2016 Workshops. Cham: Springer, 2016, 9914: 850−865.
[6]	Valmadre J, Bertinetto L, Henriques J F, et al. End-to-end representation learning for correlation filter based tracking[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5000−5008.
[7]	Li Bo, Yan Junjie, Wu Wei, et al. High performance visual tracking with Siamese region proposal network[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 8971−8980.
[8]	Zhu Zheng, Wang Qiang, Li Bo, et al. Distractor-aware Siamese networks for visual object tracking[C]//The 15th European Conference on Computer Vision, 2018: 103−119.
[9]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2018-12-15]. https://arxiv.org/abs/1409.1556.
[10]	Bromley J, Guyon I, LeCun Y, et al. Signature verification using a “Siamese” time delay neural network[C]//Advances in Neural Information Processing Systems, 1994: 737−744.
[11]	Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks[C]//CVPR, 2015.
[12]	Wang N, Shi J, Yeung D, et al. Understanding and Diagnosing Visual Tracking Systems[C]//2015 IEEE International Conference on Computer Vision (ICCV), 2015: 3101−3109.
[13]	黎万义, 王鹏, 乔红. 引入视觉注意机制的目标跟踪方法综述[J]. 自动化学报, 2014, 40(4): 561-576.	Li Wanyi, Wang Peng, Qiao Hong. A survey of visual attention based methods for object tracking [J]. Acta automatica sinica, 2014, 40(4): 561-576. (in Chinese)
[14]	Woo S, Park J, Lee J Y, etal. CBAM: Convolutional Block Attention Module[C]// European Conference on Computer Vision, 2018: 3−19.
[15]	Russakovsky O, Deng J, Su H, et al. Image net large scale visual recognition challenges[C]//IJCV, 2015.
[16]	Lin T-Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//ECCV, 2014: 740−755.
[17]	Real E, Shlens J, Mazzocchi S, et al. Youtube boundingboxes: A large high-precision human-annotated data set for object detection in video[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017: 7464−7473.
[18]	Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 618−626.
[19]	Wu Y, Lim J, Yang M H. Object tracking benchmark [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1834-1848.

Siamese networks tracking algorithm integrating channel-interconnection-spatial attention

doi: 10.3788/IRLA20200148

Abstract

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Related

Proportional views