ZHANG Lili, ZHANG Long, XIAO Ruiyang, et al. Cross-modal target detection based on adaptive calibration fusionJ. Infrared and Laser Engineering, 2026, 55(1): 20250440. DOI: 10.3788/IRLA20250440
Citation: ZHANG Lili, ZHANG Long, XIAO Ruiyang, et al. Cross-modal target detection based on adaptive calibration fusionJ. Infrared and Laser Engineering, 2026, 55(1): 20250440. DOI: 10.3788/IRLA20250440

Cross-modal target detection based on adaptive calibration fusion

  • Objective To address the issues of modality misalignment, inter-modality interference, and the mismatch between spatial and semantic information prior to fusion in infrared and visible light fusion target detection, a cross-modal target detection model based on adaptive calibration fusion (ACF-YOLO) is proposed. ACF-YOLO utilizes a multi-kernel three-way perception module to extract features from vertical, horizontal, and global directions through partial convolution and multi-scale convolution. This approach mitigates spatial information loss and alleviates feature mismatch. By employing a cross-modal adaptive feature sampling module (MAFSM), it learns the offset between visible and infrared images, adjusts target spatial coordinates via sampling operations, and achieves inter-modality alignment. Furthermore, it adaptively fuses the aligned features based on semantic and spatial information to resolve inter-modality interference. Experimental results demonstrate that ACF-YOLO achieves an mAP50 of 83.7% and an mAP50-95 of 50.9% on the RGBT-Tiny dataset, representing improvements of 4.3% in mAP50 and 4.9% in mAP50-95 over the baseline. On the LLVIP dataset, it attains an mAP50 of 96.7% and an mAP50-95 of 62.6%, outperforming other comparative models.
    Methods Given a pair of RGB and IR images, two specific modes of backbone networks are first used to extract features. In order to improve the feature extraction ability of backbone networks and solve the mismatch between spatial information and semantic information before modal fusion, MTPM is designed, this module extracts multi-scale texture features between different receptive fields by utilizing non-expansive initial deep convolutions from different directions, thereby reducing the loss of spatial information in the network. Then, the extracted RGB and IR feature maps are connected and input into the neck network for feature interaction and fusion. In this paper, MAFSM is introduced into the neck network, which uses the spatial information branch and semantic information branch to predict the offset of the target, adjusts the spatial coordinates of the object by sampling operation, and adaptively fuses the features of RGB image and IR image, and solves the problem of modal misalignment and modal interference at the same time. Finally, the RGB and IR feature maps after alignment and adaptive fusion operation are entered into the detection head, and the bounding boxes and detection scores of all objects are obtained (Fig.1).
    Results and Discussions In the RGBT-Tiny dataset, we compare the proposed ACF-YOLO with the excellent object detectors RT-DETR, Mamba YOLO, DEIM, YOLOV5, YOLOV8 and YOLOV12 et al. to validate our algorithm. As shown in Table 2, in the RGBT-Tiny dataset, the algorithm in the visible light mode is significantly better than the algorithm in the infrared mode, and YOLOV8 in the dual mode reaches 79.4% mAP50 and 46.0% mAP50-95, which is the best performance among all algorithms. And the ACF-YOLO proposed in this paper has obviously higher advantages than the dual-mode YOLOV8, with a 4.3% higher mAP50 and a 4.9% higher mAP50-95. This is made possible by our MTPM enhancements to backbone networks and MAFSM's alignment and adaptive fusion of RGB and IR images. In the LLVIP dataset, Table 3 shows the proposed ACF-YOLO and state-of-the-art methods IRFS, SIDNet, GAFF, ProEn, CDDFuse, Swin Fusion, PIAFusion, U2Fusion, CSAA on LLVIP dataset. It can be seen that ACF-YOLO outperforms all competitors with 96.7% in mAP50 and 62.6% in mAP50-95, and for comparison, ACF-YOLO outperforms the existing best methods PIAFusion by 0.6% on mAP50 and 0.2% on mAP50-95.
    Conclusions To address the issues of modal misalignment, inter-modal interference, and the mismatch of spatial and semantic information before modal fusion in infrared and visible light fusion-based object detection methods, we propose a cross-modal object detector called ACF-YOLO based on adaptive calibration fusion. ACF-YOLO uses the MTPM module to extract features from feature maps in vertical, horizontal, and global directions, reducing the loss of spatial information and mitigating the mismatch between spatial and semantic information. The MAFSM is used to learn the offsets between infrared and visible images to adjust the spatial coordinates of objects, achieving modal alignment, and adaptively fusing aligned features based on semantic and spatial information. To validate the proposed method, we conducted comparative experiments on the RGBT-Tiny and LLVIP datasets. On the RGBT-Tiny dataset, the proposed method achieved an mAP50 of 83.7% and an mAP50-95 of 50.9%, which are 4.3% and 4.9% higher than the baseline, respectively. On the LLVIP dataset, the mAP50 reached 96.7% and the mAP50-95 reached 62.6%. These results indicate that the overall performance of ACF-YOLO not only surpasses single-modal models such as RT-DETR, Mamba YOLO, and DEIM, but also outperforms dual-modal models like YOLOv8. Additionally, we conducted extensive ablation studies on the RGBT-Tiny dataset, and the results demonstrated the effectiveness of each module.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return