Abstract:
Objective Person re-identification is one of the significant research directions in the field of computer vision, with the goal of accurately identifying the same person in images captured by different cameras. This technology has extensive application value in various fields such as public security, intelligent surveillance, and traffic management. Most ReID methods primarily utilize visible light images collected under ideal lighting conditions for identification. However, in low-illumination environments, visible light cameras are limited by their imaging capabilities, often failing to obtain high-quality images. Issues such as insufficient brightness, blurred details, and loss of appearance information frequently occur, which severely degrade recognition performance and make it difficult to meet the practical requirements of all-weather surveillance. In contrast, infrared imaging technology possesses excellent night perception capabilities and can stably capture the thermal radiation contours of the human body in low-light or no-light environments. Therefore, visible-infrared cross-modal person re-identification, which fuses information from visible light and infrared images, has become a crucial research direction for improving recognition performance under low-light conditions by reducing the modal differences between the two types of images, and has gradually attracted widespread attention.
Methods A cross-modal person re-identification model based on spatio-temporal interaction and multi-level feature embedding is proposed (Fig.1). Firstly, the extracted features are reconstructed in terms of spatial perspective dependencies and style correlations through spatio-temporal interaction attention, so as to enhance their expressive ability (Fig.2). Subsequently, an adaptive fusion module is adopted to deeply fuse the features of visible and infrared modalities at both global and channel feature levels, and map them to a unified feature space to achieve accurate matching (Fig.3). Then, a time-space pooling enhancement module is introduced, which optimizes the global feature structure and focuses on local dynamic changes through spatial and temporal operations, further enriching the semantic information of the two modalities (Fig.4). Finally, a quaternion center triplet loss is employed to constrain samples of the same category to converge to the ideal center of their corresponding modality, while expanding the feature distance between centers of different categories, thus significantly improving the performance of cross-modal person re-identification.
Results and Discussions To verify the effectiveness of the model, tests are carried out on the SYSU-MM01 and RegDB datasets. On the SYSU-MM01 dataset, compared with the benchmark model MSCMNet, in the full-search mode, Rank-1 and mAP are increased by 1.73% and 1.85% respectively; in the indoor-search mode, they are increased by 1.67% and 1.36% respectively, indicating that the model can effectively mine cross-modal feature associations (Tab.5). On the RegDB dataset, in the visible-to-infrared mode, Rank-1 and mAP reach 91.8% and 81.4%, increased by 1.7% and 1.6% respectively; in the infrared-to-visible mode, Rank-1 and mAP reach 88.7% and 79.2%, increased by 1.3% and 1.7% respectively (Tab.6), verifying the robustness of the model in different modal conversion directions.
Conclusions The proposed cross-modal person re-identification model with spatio-temporal interaction and multi-level feature embedding alleviates the problem of spatio-temporal feature distribution shift caused by cross-modal spectral differences, enhances the discriminability of modality-shared features, and achieves refined calibration of cross-modal spatio-temporal information, thus realizing higher recognition accuracy. Based on existing research achievements, a spatio-temporal interaction attention is designed. By reconstructing features of spatial perspective dependence and cross-style correlation, it effectively alleviates the mis-matching problem caused by misalignment of cross-modal dynamic features. In addition, an adaptive fusion module is proposed. By guiding the global-channel fusion of visible light and infrared features, it enhances the spatio-temporal feature representation. Finally, a spatio-temporal pooling enhancement module is designed. Through parallel spatial-temporal operations to jointly calibrate global and local features, it significantly improves network performance.This model achieves stable improvement of key indicators under different datasets and retrieval modes, significantly enhancing the discriminative power of pedestrian features and recognition accuracy.