时空交互和多层级特征嵌入的跨模态行人重识别

刘仲民; 闫星; 胡文瑾

doi:10.3788/IRLA20250363

时空交互和多层级特征嵌入的跨模态行人重识别

Spatiotemporal interaction and multi-level feature embedding for cross-modal person re-identification

摘要

摘要: 针对跨模态行人重识别(Person Re-identification, ReID)方法中存在的时空特征分布偏移、模态共享特征鉴别性较弱、跨模态时空信息校准粗糙的问题，提出一种基于时空交互和多层级特征嵌入的跨模态ReID模型。首先设计一种时空交互注意力，通过动态视角感知机制重构跨模态空间关联，有效解决因成像机制差异导致的时空特征分布偏移问题；其次建构自适应融合模块，通过跨模态特征相关性度量实现多尺度特征的逐通道非线性融合，显著增强模态共享特征的判别能力；最后设计时间-空间池化增强模块，通过结合空间与时间维度运算对全局与局部特征进行协同校准，实现跨模态时空信息的精细化校准。在公开数据集SYSU-MM01和RegDB上的实验表明，所提出方法在mAP指标分别提高了1.85%和1.7%，验证了模型在复杂场景下的高识别精度与鲁棒性，对推动跨模态ReID在实际安防中的应用具有重要意义。

Abstract:
Objective Person re-identification is one of the significant research directions in the field of computer vision, with the goal of accurately identifying the same person in images captured by different cameras. This technology has extensive application value in various fields such as public security, intelligent surveillance, and traffic management. Most ReID methods primarily utilize visible light images collected under ideal lighting conditions for identification. However, in low-illumination environments, visible light cameras are limited by their imaging capabilities, often failing to obtain high-quality images. Issues such as insufficient brightness, blurred details, and loss of appearance information frequently occur, which severely degrade recognition performance and make it difficult to meet the practical requirements of all-weather surveillance. In contrast, infrared imaging technology possesses excellent night perception capabilities and can stably capture the thermal radiation contours of the human body in low-light or no-light environments. Therefore, visible-infrared cross-modal person re-identification, which fuses information from visible light and infrared images, has become a crucial research direction for improving recognition performance under low-light conditions by reducing the modal differences between the two types of images, and has gradually attracted widespread attention.
Methods A cross-modal person re-identification model based on spatio-temporal interaction and multi-level feature embedding is proposed (Fig.1). Firstly, the extracted features are reconstructed in terms of spatial perspective dependencies and style correlations through spatio-temporal interaction attention, so as to enhance their expressive ability (Fig.2). Subsequently, an adaptive fusion module is adopted to deeply fuse the features of visible and infrared modalities at both global and channel feature levels, and map them to a unified feature space to achieve accurate matching (Fig.3). Then, a time-space pooling enhancement module is introduced, which optimizes the global feature structure and focuses on local dynamic changes through spatial and temporal operations, further enriching the semantic information of the two modalities (Fig.4). Finally, a quaternion center triplet loss is employed to constrain samples of the same category to converge to the ideal center of their corresponding modality, while expanding the feature distance between centers of different categories, thus significantly improving the performance of cross-modal person re-identification.
Results and Discussions To verify the effectiveness of the model, tests are carried out on the SYSU-MM01 and RegDB datasets. On the SYSU-MM01 dataset, compared with the benchmark model MSCMNet, in the full-search mode, Rank-1 and mAP are increased by 1.73% and 1.85% respectively; in the indoor-search mode, they are increased by 1.67% and 1.36% respectively, indicating that the model can effectively mine cross-modal feature associations (Tab.5). On the RegDB dataset, in the visible-to-infrared mode, Rank-1 and mAP reach 91.8% and 81.4%, increased by 1.7% and 1.6% respectively; in the infrared-to-visible mode, Rank-1 and mAP reach 88.7% and 79.2%, increased by 1.3% and 1.7% respectively (Tab.6), verifying the robustness of the model in different modal conversion directions.
Conclusions The proposed cross-modal person re-identification model with spatio-temporal interaction and multi-level feature embedding alleviates the problem of spatio-temporal feature distribution shift caused by cross-modal spectral differences, enhances the discriminability of modality-shared features, and achieves refined calibration of cross-modal spatio-temporal information, thus realizing higher recognition accuracy. Based on existing research achievements, a spatio-temporal interaction attention is designed. By reconstructing features of spatial perspective dependence and cross-style correlation, it effectively alleviates the mis-matching problem caused by misalignment of cross-modal dynamic features. In addition, an adaptive fusion module is proposed. By guiding the global-channel fusion of visible light and infrared features, it enhances the spatio-temporal feature representation. Finally, a spatio-temporal pooling enhancement module is designed. Through parallel spatial-temporal operations to jointly calibrate global and local features, it significantly improves network performance.This model achieves stable improvement of key indicators under different datasets and retrieval modes, significantly enhancing the discriminative power of pedestrian features and recognition accuracy.

HTML全文

参考文献(32)

施引文献

资源附件(0)