A survey of siamese networks tracking algorithm  integrating detection technology

Zhang Jinpu; Wang Yuehuan

doi:10.3788/IRLA20220042

Volume 51 Issue 10

Oct. 2022

Turn off MathJax

Article Contents

Article Navigation > Infrared and Laser Engineering > 2022 > 51(10): 20220042

Zhang Jinpu, Wang Yuehuan. A survey of siamese networks tracking algorithm integrating detection technology[J]. Infrared and Laser Engineering, 2022, 51(10): 20220042. doi: 10.3788/IRLA20220042

Citation:

Zhang Jinpu, Wang Yuehuan. A survey of siamese networks tracking algorithm integrating detection technology[J]. Infrared and Laser Engineering, 2022, 51(10): 20220042. doi: 10.3788/IRLA20220042

A survey of siamese networks tracking algorithm integrating detection technology

doi: 10.3788/IRLA20220042

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

Received Date: 2022-01-13
Rev Recd Date: 2022-03-22

Available Online: 2022-11-02

Publish Date: 2022-10-28

Abstract

In recent years, siamese tracking networks have achieved promising performance in visual tracking. However, there is still large room for improvement in the challenge of target state estimation and complex aberrances for siamese trackers. With the success of deep learning in object detection, more and more object detection technologies are used to guide object tracking. This survey reviews the siamese tracking algorithms integrating detection technologies. Firstly, we introduce the relation and difference between detection and tracking, and analyze the feasibility of improving siamese tracking algorithms by detection technologies. Then, we elaborate the existing siamese trackers based on different detection frameworks. Furthermore, we conduct extensive experiments to compare and analyze the representative methods on the popular OTB100, VOT2018, GOT-10k, and LaSOT benchmarks. Finally, we summarize our manuscript and prospect the further trends of visual tracking.
- object tracking,
- deep learning,
- siamese network,
- object detection

References

[1]	Laurense V A, Goh J Y, Gerdes J C. Path-tracking for autonomous vehicles at the limit of friction[C]//Proceedings of the American Control Conference, 2017: 5586-5591.
[2]	Wang Y H, Chai H W, Yang D Y. Improved KCF real-time target tracking algorithm [J]. Journal of Huazhong University of Science and Technology, 2020, 48(1): 5. (in Chinese)
[3]	Wu Y, Lim J, Yang M H. Object tracking benchmark [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1834-1848. doi: 10.1109/TPAMI.2014.2388226
[4]	Li P, Wang D, Wang L, et al. Deep visual tracking: Review and experimental comparison [J]. Pattern Recognition, 2018, 76: 323-338. doi: 10.1016/j.patcog.2017.11.007
[5]	Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010: 2544-2550.
[6]	Henriques J F, Caseiro R, Martins P, et al. High-speed tracking with kernelized correlation filters [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583-596. doi: 10.1109/TPAMI.2014.2345390
[7]	Danelljan M, Hager G, Khan F S, et al. Discriminative scale space tracking [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(8): 1561-1575. doi: 10.1109/TPAMI.2016.2609928
[8]	Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005: 886-893.
[9]	Van De Weijer J, Schmid C, Verbeek J. Learning color names from real-world images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
[10]	Ma C, Huang J B, Yang X, et al. Hierarchical convolutional features for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 3074-3082.
[11]	Danelljan M, Robinson A, Shahbaz Khan F, et al. Beyond correlation filters: Learning continuous convolution operators for visual tracking[C]//European Conference on Computer Vision, 2016: 472-488.
[12]	Luo H B, Xu L Y, Hui B, et al. Status and prospect of target tracking based on deep learning [J]. Infrared and Laser Engineering, 2017, 46(5): 0502002. (in Chinese) doi: 10.3788/IRLA201746.0502002
[13]	Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge [J]. International Journal of Computer Vision, 2015, 115(3): 211-252. doi: 10.1007/s11263-015-0816-y
[14]	Bertinetto L, Valmadre J, Henriques J F, et al. Fully-convolutional siamese networks for object tracking[C]//Proceedings of the European Conference on Computer Vision, 2016: 850-865.
[15]	Valmadre J, Bertinetto L, Henriques J, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2805-2813.
[16]	Dai K, Wang Y, Yan X. Long-term object tracking based on siamese network[C]//IEEE International Conference on Image Processing (ICIP), 2017: 3640-3644.
[17]	Chopra S, Hadsell R, LeCun Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005: 539-546.
[18]	Li B, Wu W, Wang Q, et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291.
[19]	Zhang Z, Peng H, Fu J, et al. Ocean: Object-aware anchor-free tracking[C]//Proceedings of the European Conference on Computer Vision, 2020, 12366: 771-787.
[20]	Yan B, Peng H, Wu K, et al. LightTrack: Finding lightweight neural networks for object tracking via one-shot architecture search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 15180-15189.
[21]	Wang G, Luo C, Sun X, et al. Tracking by instance detection: A meta-learning approach[C]//Conference on Computer Vision and Pattern Recognition, 2020: 6287-6296.
[22]	Zou Z, Shi Z, Guo Y, et al. Object detection in 20 years: A survey[DB/OL]. (2019-05-16)[2022-01-13]. https://doi.org/10.48550/arXiv.1905.05055.
[23]	Chen Y F, Wu Y, Zhang W. Survey of target tracking algorithm based on siamese network structure [J]. Computer Engineering and Applications, 2020, 56(6): 10-18. (in Chinese) doi: 10.3778/j.issn.1002-8331.1911-0127
[24]	Kristan M, Lukeˇ A, Drbohlav O, et al. The Eighth Visual Object Tracking VOT2020 Challenge Results[M]. Switzerland: Springer, 2020.
[25]	He A, Luo C, Tian X, et al. A twofold siamese network for real-time object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4834-4843.
[26]	Wang Q, Teng Z, Xing J, et al. Learning attentions: residual attentional siamese network for high performance online visual tracking[C]//Conference on Computer Vision and Pattern Recognition, 2018: 4854-4863.
[27]	Dong X, Shen J. Triplet Loss in Siamese Network for Object Tracking[M]. Switzerland: Springer, 2018: 472-488.
[28]	Cui Z J, An J S, Cui T S. Siamese networks tracking algorithm integrating channel-interconnection-spatial attention [J]. Infrared and Laser Engineering, 2021, 50(3): 20200148. (in Chinese) doi: 10.3788/IRLA20200148
[29]	Li B, Yan J, Wu W, et al. High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980.
[30]	Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031
[31]	Wang Q, Zhang L, Bertinetto L, et al. Fast online object tracking and segmentation: A unifying approach[C]//Conference on Computer Vision and Pattern Recognition, 2019: 1328-1338.
[32]	Chen B X, Tsotsos J K. Fast visual object tracking with rotated bounding boxes[DB/OL]. (2019-09-12)[2022-01-13]. https://doi.org/10.48550/arXiv.1907.03892.
[33]	Zhou W, Wen L, Zhang L, et al. SiamMan: Siamese motion-aware network for visual tracking[DB/OL]. (2020-01-18)[2022-01-13]. https://doi.org/10.48550/arXiv.1912.05515.
[34]	Liao B, Wang C, Wang Y, et al. Pg-net: Pixel to global matching network for visual tracking[C]//European Conference on Computer Vision, 2020: 429-444.
[35]	Zhu Z, Wang Q, Li B, et al. Distractor-aware siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision, 2018: 101-117.
[36]	He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[37]	Zhang Z, Peng H. Deeper and wider siamese networks for real-time visual tracking[C]//Conference on Computer Vision and Pattern Recognition, 2019: 4586-4595.
[38]	Li B, Wu W, Wang Q, et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks[C]//Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291.
[39]	Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Conference on Computer Vision and Pattern Recognition, 2017: 936-944.
[40]	Guo D, Wang J, Cui Y, et al. SiamCAR: siamese fully convolutional classification and regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6268-6276.
[41]	Xu Y, Wang Z, Li Z, et al. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 12549-12556.
[42]	Tian Z, Shen C, Chen H, et al. FCOS: Fully convolutional one-stage object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 9627-9636.
[43]	Chen Z, Zhong B, Li G, et al. Siamese box adaptive network for visual tracking[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020: 6667-6676.
[44]	Zhang Z, Liu Y, Li B, et al. Toward accurate pixelwise object tracking via attention retrieval [J]. IEEE Transactions on Image Processing, 2021, 30: 8553-8566. doi: 10.1109/TIP.2021.3117077
[45]	Cui Y, Jiang C, Wang L, et al. Fully convolutional online tracking[DB/OL]. (2021-09-26)[2022-01-13]. https://doi.org/10.48550/arXiv.2004.07109.
[46]	Zhou X, Wang D, Krähenbühl P. Objects as points[DB/OL]. (2019-04-29)[2022-01-13]. https://doi.org/10.48550/arXiv.1904.07850.
[47]	Law H, Deng J. Cornernet: Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision, 2018: 765-781.
[48]	Gao P, Yuan R, Wang F, et al. Siamese attentional keypoint network for high performance visual tracking [J]. Knowledge-based Systems, 2020, 193: 105448. doi: 10.1016/j.knosys.2019.105448
[49]	Peng S, Wang K, Yu Y, et al. Accurate anchor free tracking[DB/OL]. (2020-06-13)[2022-01-13]. https://doi.org/10.48550/arXiv.2006.07560.
[50]	Du F, Liu P, Zhao W, et al. Correlation-guided attention for corner detection based visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6835-6844.
[51]	Yan B, Zhang X, Wang D, et al. Alpha-refine: Boosting tracking performance by precise bounding box estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 5289-5298.
[52]	Ma Z, Wang L, Zhang H, et al. Rpt: Learning point set representation for siamese visual tracking[C]//European Conference on Computer Vision, 2020: 653-665.
[53]	Yang Z, Liu S, Hu H, et al. Reppoints: Point set representation for object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 9657-9666.
[54]	Sauer A, Aljalbout E, Haddadin S. Tracking holistic object representations[DB/OL]. (2019-08-06)[2022-01-13]. https://doi.org/10.48550/arXiv.1907.12920.
[55]	Yu Y, Xiong Y, Huang W, et al. Deformable siamese attention networks for visual object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6727-6736.
[56]	Xu T, Feng Z H, Wu X J, et al. AFAT: Adaptive failure-aware tracker for robust visual object tracking[DB/OL]. (2020-05-27)[2022-01-13]. https://doi.org/10.48550/arXiv.2005.13708.
[57]	Zhang L, Gonzalez-Garcia A, van de Weijer J, et al. Learning the model update for siamese trackers[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 4009-4018.
[58]	Zhou J, Wang P, Sun H. Discriminative and robust online learning for siamese visual tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 13017-13024.
[59]	Wang G, Luo C, Xiong Z, et al. Spm-tracker: Series-parallel matching for real-time visual object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 3643-3652.
[60]	Sung F, Yang Y, Zhang L, et al. Learning to compare: Relation network for few-shot learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 1199-1208.
[61]	Yan B, Zhao H, Wang D, et al. “Skimming-perusal” tracking: A framework for real-time and robust long-term tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 2385-2393.
[62]	Zhang H W, Li X X, Zhu B, et al. Two-stage object tracking method based on siamese neural network [J]. Infrared and Laser Engineering, 2021, 50(9): 20200491. (in Chinese) doi: 10.3788/IRLA20200491
[63]	Fan H, Ling H. Siamese cascaded region proposal networks for real-time visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 7952-7961.
[64]	Li Q, Qin Z, Zhang W, et al. Siamese keypoint prediction network for visual object tracking[DB/OL]. (2020-06-07)[2022-01-13]. https://doi.org/10.48550/arXiv.2006.04078.
[65]	Bhat G, Danelljan M, Van Gool L, et al. Learning discriminative model prediction for tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 6181-6190.
[66]	Danelljan M, van Gool L, Timofte R. Probabilistic regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 7181-7190.
[67]	Choi J, Kwon J, Lee K M. Visual Tracking by Tridentalign and Context Embedding[M]. Switzerland: Springer, 2020: 504-520.
[68]	Huang L, Zhao X, Huang K. Globaltrack: A simple and strong baseline for long-term tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11037-11044.
[69]	Voigtlaender P, Luiten J, Torr P H S, et al. Siam R-CNN: Visual tracking by re-detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6577-6587.
[70]	Dave A, Tokmakov P, Schmid C, et al. Learning to track any object[DB/OL]. (2019-10-25)[2022-01-13]. https://doi.org/10.48550/arXiv.1910.11844.
[71]	Huang L, Zhao X, Huang K. Bridging the gap between detection and tracking: A unified approach[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 3998-4008.
[72]	Danelljan M, Bhat G, Khan F S, et al. ATOM: Accurate tracking by overlap maximization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 4655-4664.
[73]	Jiang B, Luo R, Mao J, et al. Acquisition of localization confidence for accurate object detection[C]//Proceedings of the European Conference on Computer Vision, 2018.
[74]	Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning, 2017: 1126-1135.
[75]	Antoniou A, Edwards H, Storkey A. How to train your maml[C]//International Conference on Learning Representations, 2019.
[76]	Li Z, Zhou F, Chen F, et al. Meta-SGD: Learning to learn quickly for few-shot learning[DB/OL].(2017-09-28)[2022-01-13]. https://doi.org/10.48550/arXiv.1707.09835.
[77]	Kristan M, Leonardis A, Matas J, et al. The Sixth Visual Object Tracking VOT2018 Challenge Results[M]. Switzerland: Springer, 2018: 3-53.
[78]	Huang L, Zhao X, Huang K. Got-10 k: A large high-diversity benchmark for generic object tracking in the wild [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562-1577.
[79]	Fan H, Lin L, Yang F, et al. Lasot: A high-quality benchmark for large-scale single object tracking[C]//Conference on Computer Vision and Pattern Recognition, 2019: 5374-5383.
[80]	Han G, Du H, Liu J, et al. Fully conventional anchor-free siamese networks for object tracking [J]. IEEE Access, 2019, 7: 123934-123943. doi: 10.1109/ACCESS.2019.2937998
[81]	Danelljan M, Gool L Van, Timofte R. Probabilistic regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 7181-7190.
[82]	Choi J, Chun D, Kim H, et al. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 502-511.
[83]	He Y, Zhu C, Wang J, et al. Bounding box regression with uncertainty for accurate object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 2888-2897.
[84]	Zhu B, Wang J, Jiang Z, et al. Autoassign: Differentiable label assignment for dense object detection[DB/OL]. (2020-11-25)[2022-01-13]. https://doi.org/10.48550/arXiv.2007.03496.
[85]	Li X, Wang W, Wu L, et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection[C]//Advances in Neural Information Processing Systems, 2020.
[86]	Oksuz K, Cam B C, Kalkan S, et al. Imbalance problems in object detection: A review [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3388-3415.
[87]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 5998-6008.
[88]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16 x16 words: Transformers for image recognition at scale[C]//International Conference on Learning Representations, 2021.
[89]	Chen X, Yan B, Zhu J, et al. Transformer tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 8126-8135.
[90]	Yan B, Peng H, Fu J, et al. Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2021: 10448-10457.
[91]	Wang N, Zhou W, Wang J, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 1571-1580.
[92]	Lin L, Fan H, Xu Y, et al. SwinTrack: A simple and strong baseline for transformer tracking[DB/OL]. (2021-12-08)[2022-01-13]. https://doi.org/10.48550/arXiv.2112.00995.

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(8) / Tables(2)

Get Citation

PDF

XML

Article Metrics

Article views(407) PDF downloads(81) Cited by()

HTML

3. 实验与分析

该节将在OTB100^[3]、VOT2018^[77]、GOT-10 k^[78]和LaSOT^[79]四个公开数据集上对上述40多个跟踪算法进行全面评估。首先对数据集和相应的性能评估方法进行介绍，然后对实验结果进行对比和分析。所有测试结果均来自原论文或官方源码。

3.1. 数据集和评估方式

(1) OTB100

Wu^[3]等人2015年提出的OTB100是目前最为常用的跟踪数据集之一。该数据集包含100个完全标注的视频序列，涉及目标跟踪的11种属性，包括光照变化、尺度变化、遮挡、形变、运动模糊、快速运动、平面内旋转、平面外旋转、出视野、背景干扰和低分辨率。OTB的评价指标为距离精度（Distance Precision）和重叠成功率（Overlap Success），测试时采用一次通过评估（One-Pass Evaluation, OPE）。

(2) VOT2018

VOT2018^[77]数据集包含60个旋转框标注序列，涵盖遮挡、光照变化、运动变化、尺度变化、相机运动和空闲6种属性。VOT具有重启机制，当重叠率为0时，跟踪器会被重新初始化。VOT2018的评价指标为精确性（Accuracy）、鲁棒性（Robustness）和EAO（Expected average overlap）。

(3) GOT-10 k

GOT-10 k^[78]是一个通用大规模目标跟踪数据集，包含超过10 K个视频序列，563个类别和超过150万个标注框，尽可能多地涵盖具有挑战性的现实场景。GOT-10 k训练集和测试集不存在交集，保证模型的泛化能力。评价指标为平均重叠率（Average Overlap, AO）和成功率（Success Rate, SR）。

(4) LaSOT

LaSOT^[79]包含1400个视频和超过3.5 M手工标注图片，是目前最大的密集标注单目标跟踪数据集。该数据集包含70个类别，每个类别包含20个序列，每个序列平均2512帧，偏重长时跟踪任务且难度相对较大。LaSOT划分280个序列用于测试，评价方式类似OTB，并增加一个归一化精度（Normalized Precision）指标。

3.2. 定量结果

表1展示了所有跟踪算法的定量比较结果。对于OTB100和LaSOT，按成功率（AUC）取top5，OTB100上的排名是RPT, DROL, CGACD, SiamRCNN, SiamCAR；LaSOT上的排名是SiamRCNN, PrDiMP, TACT, FCOT, DiMP。按精度（PR）排名，OTB100的前五名是RPT, DROL, SiamDW, CGACD, Ocean；而LaSOT的前五名是SiamRCNN, PrDiMP, FCOT, TACT, Ocean。从结果可以发现，对于LaSOT这类较长的视频序列，排名靠前的算法大多依赖两阶段结构和模型更新。两阶段结构对于鲁棒性和判别性的平衡能有效应对长时跟踪中出现的干扰物以及模型漂移，而判别式的更新方法也能及时处理目标和场景的各类变化。

	TYPE		OTB100		LaSOT		GOT10 k			VOT2018
	A	S	AUC	PR	AUC.	NPR	AO	SR0.50	SR0.75	A	R	EAO
SiamRPN ^[29]	T	1	0.637	0.851	0.457	-	-	-	-	-	-	-
DaSiamRPN ^[35]	T	1	0.658	0.88	0.415	0.496	-	-	-	0.59	0.276	0.383
SiamRPN++ ^[38]	T	1	0.696	0.915	0.496	0.569	0.518	0.618	0.325	0.6	0.234	0.414
SiamDW ^[37]	T	1	0.674	0.923	0.384	0.476	0.416	-	-	-	-	0.27
SiamMask ^[31]	T	1	-	-	-	-	0.514	0.587	0.366	0.61	0.276	0.38
SiamMan ^[33]	T	1	0.705	0.919	-	-	-	-	-	0.605	0.183	0.462
THOR ^[54]	T	1	0.648	0.791	-	-	0.447	0.538	0.204	0.582	0.234	0.416
DROL ^[58]	T	1	0.715	0.934	0.537	0.624	-	-	-	0.616	-	0.481
SiamAttn ^[55]	T	1	0.712	0.926	0.56	0.648	-	-	-	0.636	0.16	0.47
AFAT ^[56]	T	1	0.663	0.874	0.492	0.574	-	-	-	0.605	0.239	0.419
UpdateNet ^[57]	T	1	-	-	0.475	0.56	-	-	-	-	-	0.393
SiamFC++ ^[41]	F	1	0.683	0.896	0.544	0.623	0.595	0.695	0.479	0.587	0.183	0.426
AFSN ^[49]	F	1	0.675	0.868	-	-	-	-	-	0.589	0.204	0.398
SATIN ^[48]	F	1	0.641	0.844	-	-	-	-	-	-	-	-
SiamBAN ^[43]	F	1	0.696	0.91	0.514	0.598	-	-	-	0.597	0.178	0.452
SiamCAR ^[40]	F	1	0.697	0.91	-	-	0.569	0.67	0.415	-	-	-
CGACD ^[50]	F	1	0.713	0.922	0.518	0.626	-	-	-	0.615	0.173	0.449
FCAF ^[80]	F	1	0.649	0.86	-	-	-	-	-	-	-	-
FCOT ^[45]	F	1	0.693	0.913	0.569	0.678	0.64	0.763	0.517	0.6	0.108	0.508
PGNet ^[34]	F	1	0.691	0.892	0.531	0.605	-	-	-	0.618	0.192	0.447
Ocean ^[19]	F	1	0.684	0.92	0.56	-	0.611	0.721	0.473	0.592	0.117	0.489
Ocean+ ^[44]	F	1	-	-	-	-	-	-	-	-	-	-
RPT ^[52]	F		0.715	0.936	-	-	0.624	0.73	0.504	0.629	0.103	0.51
AlphaRef ^[51]		1	-	-	0.589	0.649	-	-	-	0.633	0.136	0.476
SiamKPN ^[64]	F	2	0.712	0.927	0.498	-	0.529	0.606	0.362	0.606	0.192	0.44
SPLT ^[61]	T	2	-	-	0.426	0.494	-	-	-	-	-	-
CRPN ^[63]	T	2	0.663	-	0.455	0.542	-	-	-	-	-	-
SPM ^[59]	T	2	0.687	0.889	0.485	-	0.513	0.593	0.359	0.58	0.3	0.338
TACT ^[67]	T	2	-	-	0.575	0.66	0.578	0.665	0.477	-	-	-
SiamRCNN ^[69]	T	2	0.701	0.891	0.648	0.722	0.649	0.728	0.597	0.609	0.22	0.408
GlobalT ^[68]	T	2	-	-	0.521	0.599	-	-	-	-	-	-
LTAO ^[70]	T	2	-	-	-	-	-	-	-	-	-	-
ATOM ^[72]	others		0.667	0.879	0.514	0.576	0.556	0.635	0.402	0.59	0.204	0.401
DiMP ^[65]	others		0.686	0.899	0.569	0.648	0.611	0.717	0.492	0.597	0.153	0.44
PrDiMP ^[66]	others		0.696	0.897	0.598	-	0.634	0.738	0.543	0.618	0.165	0.442
SSD-MAML ^[71]	others		0.62	-	-	-	-	-	-	-	-	-
FRCNN-MAML ^[71]	others		0.647	-	-	-	-	-	-	-	-	-
FCOS-MAML ^[21]	others		0.704	0.905	0.523	-	-	-	-	0.635	0.22	0.392
Retina-MAML ^[21]	others		0.712	0.926	0.48	-	-	-	-	0.604	0.159	0.452
Note: Bold fonts are ranked top-3. '-' means the corresponding results are not given in the original literature. 'TYPE' is the classification basis delineated in this paper, where 'A' indicates the Anchor (Anchor-based 'T '/Anchor-free 'F'), 'S' indicates the Stage number (One-stage '1'/Two-stages '2 '), and 'others' indicates other classes.

Table 1. Performance comparison of siamese tracking methods on OTB100, LaSOT, GOT-10 k and VOT2018

对于VOT2018，精度（A）领先的是SiamAttn, Alpha-Refine, RPT, PGNet, DROL；鲁棒性（R）领先的是RPT, FCOT, Ocean, Alpha-Refine, DiMP；EAO领先的则是RPT, FCOT, Ocean, DROL, Alpha-Refine。VOT2018的重启机制使得鲁棒性指标的波动范围很大（第一名和最后一名的精度差距0.056，鲁棒性差距0.197）。领先的方法大多为灵活的无锚框结构，它们对IOU较小的预测框有更强的矫正能力，从而避免跟踪失败重启。

对于GOT-10 k，平均重叠率（AO）领先的是SiamRCNN, FCOT, PrDiMP, RPT, DiMP；IOU阈值为0.5的成功率（SR0.50）排名为FCOT, PrDiMP, RPT, SiamRCNN, DiMP；IOU阈值为0.75的成功率（SR0.75）排名为SiamRCNN, PrDiMP, FCOT, RPT, DiMP。不难看出，对边框预测做了特殊处理（如两阶段预测、不确定性预测、在线优化、关键点表示等）的方法在SR0.75上效果普遍较好。

3.3. 讨论

综合上述方法描述以及实验分析，按照文中的分类方式总结了不同检测技术对于孪生目标跟踪算法的优缺点，如表2所示，并依此归纳出融合检测技术的孪生目标跟踪算法的六条设计经验：（1）检测网络的预测头部结构可以提升目标状态估计的精度；（2）无锚框结构相比有锚框结构对于目标形变具有更强的适应性；（3）两阶段结构面对复杂干扰场景具有更强的判别能力，而单阶段结构的速度更快；（4）将时序信息融入检测框架能更好地处理目标和场景的变化；（5）对状态估计质量单独进行评估可以进一步提升预测目标框的精度；（6）检测器具有直接转变成跟踪器的潜力。这些经验可以为后续研究者设计跟踪算法提供一定的指导。

Taxonomy		Advantage	limitation
State estimation	Anchor-based	First Introducing RPN detection technology; Discarding multi-scale search, and can predict bbox with arbitrary aspect ratio	Relying on prior knowledge; Incapable of rectifying weak prediction
State estimation	Anchor-free	Fewer parameters and faster speed; Correcting weak predictions caused by deformation and fast movement	Requiring additional constraints (such as location quality) due to the lack of prior knowledge
Stage number	One-stage	Fast speed;； Easy to add additional modules (e.g. model updates)	Weak discriminability for semantic interference
Stage number	Two-stage	Better balance of robustness and discriminability	Complex structure and slow speed
Others	IOUNet-based prediction	More accurate evaluation of location quality	-
Others	Detector transform tracker	Narrowing the differences between detection and tracking with a common pattern to solve both problems	-

Table 2. Comparison of advantages and disadvantages of siamese trackers with different detection techniques

5. 结　论

文中回顾了最近热门的融合检测技术的孪生目标跟踪方法，并通过大量实验对其进行评价。这项工作的主要贡献有三个方面。首先，按照状态估计（有锚框/无锚框），阶段数（一阶段/两阶段）和其他几个方面综述了现有的融合检测技术的孪生跟踪器，并从各个角度对这些跟踪器进行了讨论。其次，在主流的OTB100、VOT2018、GOT-10 k和LaSOT数据集上进行了广泛的实验，比较了具有代表性的方法。这种大规模的评估有助于读者理解检测框架对视觉跟踪的好处。第三，通过对这类方法的发展历史和实验结果进行分析，从目标状态估计的不确定性、训练样本的不平衡、跟踪的域自适应和其他领域经验的相互借鉴几个方面对目标跟踪存在的问题进行总结，并对未来的发展方向进行展望。

Reference (92)

[1]	Laurense V A, Goh J Y, Gerdes J C. Path-tracking for autonomous vehicles at the limit of friction[C]//Proceedings of the American Control Conference, 2017: 5586-5591.
[2]	Wang Y H, Chai H W, Yang D Y. Improved KCF real-time target tracking algorithm [J]. Journal of Huazhong University of Science and Technology, 2020, 48(1): 5. (in Chinese)
[3]	Wu Y, Lim J, Yang M H. Object tracking benchmark [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1834-1848.
[4]	Li P, Wang D, Wang L, et al. Deep visual tracking: Review and experimental comparison [J]. Pattern Recognition, 2018, 76: 323-338.
[5]	Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010: 2544-2550.
[6]	Henriques J F, Caseiro R, Martins P, et al. High-speed tracking with kernelized correlation filters [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583-596.
[7]	Danelljan M, Hager G, Khan F S, et al. Discriminative scale space tracking [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(8): 1561-1575.
[8]	Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005: 886-893.
[9]	Van De Weijer J, Schmid C, Verbeek J. Learning color names from real-world images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
[10]	Ma C, Huang J B, Yang X, et al. Hierarchical convolutional features for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 3074-3082.
[11]	Danelljan M, Robinson A, Shahbaz Khan F, et al. Beyond correlation filters: Learning continuous convolution operators for visual tracking[C]//European Conference on Computer Vision, 2016: 472-488.
[12]	Luo H B, Xu L Y, Hui B, et al. Status and prospect of target tracking based on deep learning [J]. Infrared and Laser Engineering, 2017, 46(5): 0502002. (in Chinese)
[13]	Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge [J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[14]	Bertinetto L, Valmadre J, Henriques J F, et al. Fully-convolutional siamese networks for object tracking[C]//Proceedings of the European Conference on Computer Vision, 2016: 850-865.
[15]	Valmadre J, Bertinetto L, Henriques J, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2805-2813.
[16]	Dai K, Wang Y, Yan X. Long-term object tracking based on siamese network[C]//IEEE International Conference on Image Processing (ICIP), 2017: 3640-3644.
[17]	Chopra S, Hadsell R, LeCun Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005: 539-546.
[18]	Li B, Wu W, Wang Q, et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291.
[19]	Zhang Z, Peng H, Fu J, et al. Ocean: Object-aware anchor-free tracking[C]//Proceedings of the European Conference on Computer Vision, 2020, 12366: 771-787.
[20]	Yan B, Peng H, Wu K, et al. LightTrack: Finding lightweight neural networks for object tracking via one-shot architecture search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 15180-15189.
[21]	Wang G, Luo C, Sun X, et al. Tracking by instance detection: A meta-learning approach[C]//Conference on Computer Vision and Pattern Recognition, 2020: 6287-6296.
[22]	Zou Z, Shi Z, Guo Y, et al. Object detection in 20 years: A survey[DB/OL]. (2019-05-16)[2022-01-13]. https://doi.org/10.48550/arXiv.1905.05055.
[23]	Chen Y F, Wu Y, Zhang W. Survey of target tracking algorithm based on siamese network structure [J]. Computer Engineering and Applications, 2020, 56(6): 10-18. (in Chinese)
[24]	Kristan M, Lukeˇ A, Drbohlav O, et al. The Eighth Visual Object Tracking VOT2020 Challenge Results[M]. Switzerland: Springer, 2020.
[25]	He A, Luo C, Tian X, et al. A twofold siamese network for real-time object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4834-4843.
[26]	Wang Q, Teng Z, Xing J, et al. Learning attentions: residual attentional siamese network for high performance online visual tracking[C]//Conference on Computer Vision and Pattern Recognition, 2018: 4854-4863.
[27]	Dong X, Shen J. Triplet Loss in Siamese Network for Object Tracking[M]. Switzerland: Springer, 2018: 472-488.
[28]	Cui Z J, An J S, Cui T S. Siamese networks tracking algorithm integrating channel-interconnection-spatial attention [J]. Infrared and Laser Engineering, 2021, 50(3): 20200148. (in Chinese)
[29]	Li B, Yan J, Wu W, et al. High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980.
[30]	Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[31]	Wang Q, Zhang L, Bertinetto L, et al. Fast online object tracking and segmentation: A unifying approach[C]//Conference on Computer Vision and Pattern Recognition, 2019: 1328-1338.
[32]	Chen B X, Tsotsos J K. Fast visual object tracking with rotated bounding boxes[DB/OL]. (2019-09-12)[2022-01-13]. https://doi.org/10.48550/arXiv.1907.03892.
[33]	Zhou W, Wen L, Zhang L, et al. SiamMan: Siamese motion-aware network for visual tracking[DB/OL]. (2020-01-18)[2022-01-13]. https://doi.org/10.48550/arXiv.1912.05515.
[34]	Liao B, Wang C, Wang Y, et al. Pg-net: Pixel to global matching network for visual tracking[C]//European Conference on Computer Vision, 2020: 429-444.
[35]	Zhu Z, Wang Q, Li B, et al. Distractor-aware siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision, 2018: 101-117.
[36]	He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[37]	Zhang Z, Peng H. Deeper and wider siamese networks for real-time visual tracking[C]//Conference on Computer Vision and Pattern Recognition, 2019: 4586-4595.
[38]	Li B, Wu W, Wang Q, et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks[C]//Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291.
[39]	Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Conference on Computer Vision and Pattern Recognition, 2017: 936-944.
[40]	Guo D, Wang J, Cui Y, et al. SiamCAR: siamese fully convolutional classification and regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6268-6276.
[41]	Xu Y, Wang Z, Li Z, et al. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 12549-12556.
[42]	Tian Z, Shen C, Chen H, et al. FCOS: Fully convolutional one-stage object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 9627-9636.
[43]	Chen Z, Zhong B, Li G, et al. Siamese box adaptive network for visual tracking[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020: 6667-6676.
[44]	Zhang Z, Liu Y, Li B, et al. Toward accurate pixelwise object tracking via attention retrieval [J]. IEEE Transactions on Image Processing, 2021, 30: 8553-8566.
[45]	Cui Y, Jiang C, Wang L, et al. Fully convolutional online tracking[DB/OL]. (2021-09-26)[2022-01-13]. https://doi.org/10.48550/arXiv.2004.07109.
[46]	Zhou X, Wang D, Krähenbühl P. Objects as points[DB/OL]. (2019-04-29)[2022-01-13]. https://doi.org/10.48550/arXiv.1904.07850.
[47]	Law H, Deng J. Cornernet: Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision, 2018: 765-781.
[48]	Gao P, Yuan R, Wang F, et al. Siamese attentional keypoint network for high performance visual tracking [J]. Knowledge-based Systems, 2020, 193: 105448.
[49]	Peng S, Wang K, Yu Y, et al. Accurate anchor free tracking[DB/OL]. (2020-06-13)[2022-01-13]. https://doi.org/10.48550/arXiv.2006.07560.
[50]	Du F, Liu P, Zhao W, et al. Correlation-guided attention for corner detection based visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6835-6844.
[51]	Yan B, Zhang X, Wang D, et al. Alpha-refine: Boosting tracking performance by precise bounding box estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 5289-5298.
[52]	Ma Z, Wang L, Zhang H, et al. Rpt: Learning point set representation for siamese visual tracking[C]//European Conference on Computer Vision, 2020: 653-665.
[53]	Yang Z, Liu S, Hu H, et al. Reppoints: Point set representation for object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 9657-9666.
[54]	Sauer A, Aljalbout E, Haddadin S. Tracking holistic object representations[DB/OL]. (2019-08-06)[2022-01-13]. https://doi.org/10.48550/arXiv.1907.12920.
[55]	Yu Y, Xiong Y, Huang W, et al. Deformable siamese attention networks for visual object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6727-6736.
[56]	Xu T, Feng Z H, Wu X J, et al. AFAT: Adaptive failure-aware tracker for robust visual object tracking[DB/OL]. (2020-05-27)[2022-01-13]. https://doi.org/10.48550/arXiv.2005.13708.
[57]	Zhang L, Gonzalez-Garcia A, van de Weijer J, et al. Learning the model update for siamese trackers[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 4009-4018.
[58]	Zhou J, Wang P, Sun H. Discriminative and robust online learning for siamese visual tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 13017-13024.
[59]	Wang G, Luo C, Xiong Z, et al. Spm-tracker: Series-parallel matching for real-time visual object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 3643-3652.
[60]	Sung F, Yang Y, Zhang L, et al. Learning to compare: Relation network for few-shot learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 1199-1208.
[61]	Yan B, Zhao H, Wang D, et al. “Skimming-perusal” tracking: A framework for real-time and robust long-term tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 2385-2393.
[62]	Zhang H W, Li X X, Zhu B, et al. Two-stage object tracking method based on siamese neural network [J]. Infrared and Laser Engineering, 2021, 50(9): 20200491. (in Chinese)
[63]	Fan H, Ling H. Siamese cascaded region proposal networks for real-time visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 7952-7961.
[64]	Li Q, Qin Z, Zhang W, et al. Siamese keypoint prediction network for visual object tracking[DB/OL]. (2020-06-07)[2022-01-13]. https://doi.org/10.48550/arXiv.2006.04078.
[65]	Bhat G, Danelljan M, Van Gool L, et al. Learning discriminative model prediction for tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 6181-6190.
[66]	Danelljan M, van Gool L, Timofte R. Probabilistic regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 7181-7190.
[67]	Choi J, Kwon J, Lee K M. Visual Tracking by Tridentalign and Context Embedding[M]. Switzerland: Springer, 2020: 504-520.
[68]	Huang L, Zhao X, Huang K. Globaltrack: A simple and strong baseline for long-term tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11037-11044.
[69]	Voigtlaender P, Luiten J, Torr P H S, et al. Siam R-CNN: Visual tracking by re-detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6577-6587.
[70]	Dave A, Tokmakov P, Schmid C, et al. Learning to track any object[DB/OL]. (2019-10-25)[2022-01-13]. https://doi.org/10.48550/arXiv.1910.11844.
[71]	Huang L, Zhao X, Huang K. Bridging the gap between detection and tracking: A unified approach[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 3998-4008.
[72]	Danelljan M, Bhat G, Khan F S, et al. ATOM: Accurate tracking by overlap maximization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 4655-4664.
[73]	Jiang B, Luo R, Mao J, et al. Acquisition of localization confidence for accurate object detection[C]//Proceedings of the European Conference on Computer Vision, 2018.
[74]	Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning, 2017: 1126-1135.
[75]	Antoniou A, Edwards H, Storkey A. How to train your maml[C]//International Conference on Learning Representations, 2019.
[76]	Li Z, Zhou F, Chen F, et al. Meta-SGD: Learning to learn quickly for few-shot learning[DB/OL].(2017-09-28)[2022-01-13]. https://doi.org/10.48550/arXiv.1707.09835.
[77]	Kristan M, Leonardis A, Matas J, et al. The Sixth Visual Object Tracking VOT2018 Challenge Results[M]. Switzerland: Springer, 2018: 3-53.
[78]	Huang L, Zhao X, Huang K. Got-10 k: A large high-diversity benchmark for generic object tracking in the wild [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562-1577.
[79]	Fan H, Lin L, Yang F, et al. Lasot: A high-quality benchmark for large-scale single object tracking[C]//Conference on Computer Vision and Pattern Recognition, 2019: 5374-5383.
[80]	Han G, Du H, Liu J, et al. Fully conventional anchor-free siamese networks for object tracking [J]. IEEE Access, 2019, 7: 123934-123943.
[81]	Danelljan M, Gool L Van, Timofte R. Probabilistic regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 7181-7190.
[82]	Choi J, Chun D, Kim H, et al. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 502-511.
[83]	He Y, Zhu C, Wang J, et al. Bounding box regression with uncertainty for accurate object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 2888-2897.
[84]	Zhu B, Wang J, Jiang Z, et al. Autoassign: Differentiable label assignment for dense object detection[DB/OL]. (2020-11-25)[2022-01-13]. https://doi.org/10.48550/arXiv.2007.03496.
[85]	Li X, Wang W, Wu L, et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection[C]//Advances in Neural Information Processing Systems, 2020.
[86]	Oksuz K, Cam B C, Kalkan S, et al. Imbalance problems in object detection: A review [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3388-3415.
[87]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 5998-6008.
[88]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16 x16 words: Transformers for image recognition at scale[C]//International Conference on Learning Representations, 2021.
[89]	Chen X, Yan B, Zhu J, et al. Transformer tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 8126-8135.
[90]	Yan B, Peng H, Fu J, et al. Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2021: 10448-10457.
[91]	Wang N, Zhou W, Wang J, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 1571-1580.
[92]	Lin L, Fan H, Xu Y, et al. SwinTrack: A simple and strong baseline for transformer tracking[DB/OL]. (2021-12-08)[2022-01-13]. https://doi.org/10.48550/arXiv.2112.00995.